Memory-Mapped I/O, Interrupts, and DMA

What This Concept Is

A CPU reaches the outside world through three cooperating mechanisms:

Memory-mapped I/O (MMIO) -- device registers appear as physical addresses. A load or store to those addresses is routed to the device controller instead of DRAM. There is no separate "I/O instruction" on most modern ISAs (x86_64 still has in/out for legacy ports, but drivers mostly use MMIO).
Interrupts -- when a device has news (bytes arrived, a disk completed a write, a timer expired), it raises an electrical signal. The CPU saves architectural state, jumps to an interrupt handler in the kernel, handles the event, and returns. This is the main mechanism that lets the CPU avoid polling idle devices.
DMA (direct memory access) -- a device-side engine transfers blocks of data between device memory and main memory without CPU involvement. The CPU programs a descriptor (source, destination, length), kicks off the transfer, and is interrupted when done. The CPU remains free during the transfer.

Together these give the modern "fire-and-forget" I/O pattern: program a DMA request, sleep the thread, wake on the interrupt. The CPU is not burning cycles waiting on a disk.

Why It Matters Here

This cluster is about how the machine extends past its register file and caches. A systems programmer meets these mechanisms when:

writing a driver or using /dev nodes backed by MMIO
reasoning about latency in a kernel path: interrupt delivery, bottom halves, softirqs
measuring why read() on a file does not block forever -- because the page cache is filled by DMA during readahead
understanding why tight polling loops in user space are a microarchitectural antipattern: they starve the core, heat the die, and prevent the out-of-order engine from sleeping

It is a supporting concept at this module's level; you will revisit it when studying operating systems in Module 4 and Semester 5.

Concrete Example

A typical NIC receive path on Linux:

The driver sets up a ring of receive descriptors in DRAM, each pointing at a free 2 KiB buffer.
The NIC DMAs incoming packet data into those buffers as they arrive.
When a batch has arrived, the NIC raises an interrupt.
The CPU enters the interrupt handler, which ACKs the interrupt and schedules a softirq (NAPI) to process packets.
The softirq reads descriptors, passes packets up the stack, and returns buffers to the ring.

The CPU touched exactly zero packet bytes during the actual DMA. That is why 10 GbE is tractable on a single core.

A simpler MMIO example -- a memory-mapped UART:

volatile uint8_t *UART_DATA   = (uint8_t *) 0x10000000;
volatile uint8_t *UART_STATUS = (uint8_t *) 0x10000005;

void putc(char c) {
    while (!(*UART_STATUS & 0x20)) { }   // wait until TX register empty
    *UART_DATA = c;                      // write byte; hardware sends it
}

Here the load/store is going through the memory system but is routed (by address decoding) to a device controller. volatile tells the compiler the memory can change under its feet.

Common Confusion / Misconception

"Interrupts are fast." They are not -- a single interrupt costs hundreds to thousands of cycles (save state, switch privilege level, prime caches in the handler). At high packet rates, the interrupt storm is why NICs moved to coalescing and NAPI-style polling-in-softirq. The right pattern depends on load.

Another trap: treating DMA as "copying without the CPU" while forgetting that DMA traffic goes over the same memory bus and competes for bandwidth. A high-bandwidth DMA transfer can evict a CPU thread's working set from the last-level cache (on non-snoopy caches).

How To Use It

When reasoning about I/O latency, map the steps above: device -> DMA -> interrupt -> handler -> user wake-up. Each transition has a cost.
Use volatile (or std::atomic + memory_order_relaxed) for MMIO in C/C++, never plain loads. Caches do not snoop device regions unless the memory type is configured so.
Understand your platform's memory-type attributes: "write-combining" for frame buffers, "uncacheable" for config space, "write-back" for DRAM. These shape how many loads/stores the CPU actually emits.
In user space, prefer event-driven I/O (epoll, io_uring) over busy polling. You are asking the kernel to let the CPU sleep until the interrupt arrives.

Check Yourself

Why is MMIO preferred over dedicated I/O instructions on modern ISAs?
What does an interrupt save, and why is delivering one expensive?
How does DMA free the CPU during a large transfer?
Why must code touching an MMIO region use volatile?

Mini Drill or Application

Sketch the timeline of a read(fd, buf, 4096) from an NVMe file, under a cold page cache. Include (a) the syscall, (b) the NVMe submission queue write, (c) the NVMe DMA into the page cache, (d) the completion interrupt, (e) the copy-to-user, (f) the return to user space.

Estimate the time spent in each step. Where is the CPU actually idle? Which steps involve DMA rather than CPU copies?

Now rerun the same exercise for a hot page cache read. The DMA step disappears; only the copy_to_user remains. If the user buffer is aligned and small, most of the cost is the syscall boundary and TLB churn, not the copy itself. This is why io_uring with registered buffers is faster: it removes both the syscall and the copy.

Where This Shows Up Next

Operating systems: interrupt handlers, bottom halves, softirqs, thread wake-up paths.
Networking: NIC ring buffers, NAPI, XDP, zero-copy send/receive.
Storage: NVMe submission/completion queues and polled completions.
Performance debugging: perf top and ftrace to attribute cost to interrupt context vs process context.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Where This Shows Up Next​

Read This Only If Stuck​