Skip to main content

Device I/O: Drivers, Interrupts, DMA

What This Concept Is

Beneath the block layer is a device driver that talks to actual hardware through a specific protocol: memory-mapped registers, port I/O, interrupts, and DMA (direct memory access).

The canonical interaction:

  1. CPU writes command + buffer descriptor to device registers (MMIO).
"read 8 sectors at LBA 12345, DMA into physical address 0xAB00"
2. Device performs the transfer while CPU does other work.
- HDD: moves head, reads sectors, DMAs bytes over PCIe / SATA.
- NVMe: pulls the command from a submission queue in host RAM,
executes, pushes a completion entry to a completion queue.
3. Device signals completion via interrupt (or MSI-X on PCIe).
4. CPU's interrupt handler acknowledges, reads the completion entry,
wakes waiting thread / schedules continuation.

Three primitives are fundamental:

  • Programmed I/O (PIO): CPU moves bytes between registers and memory. Simple, slow, used on tiny devices.
  • Interrupts: device asserts a line; CPU runs handler. Trades polling for wakeup cost. Can overload under high rates, leading to interrupt coalescing and NAPI on network drivers.
  • DMA: device writes or reads host memory directly over the bus while CPU does other work. Essential above low KiB/s transfer rates. Modern systems use IOMMU to restrict what physical memory a device can touch.
  [ CPU ]                [ DRAM ]              [ Device ]
| ^ |
|---- MMIO cmd ---------|-------------------->| 1. submit
| | |
| |<--- DMA data -------| 2. device -> memory
| | |
|<--------- interrupt --|---------------------| 3. signal done

Why It Matters Here

This layer is where "the kernel writes a block" becomes "sectors land on a platter" or "NAND cells change state." The performance characteristics of the rest of this module (sequential vs random, fsync cost, page cache vs O_DIRECT) all trace back to device-level behavior.

Key phenomena:

  • Interrupt storm: a 10 Gb/s NIC delivering 1M pps cannot raise 1M interrupts per second without collapsing. Drivers coalesce and poll (NAPI, busy-poll) to avoid this.
  • DMA mapping: on systems with IOMMU, setting up DMA is non-trivial; drivers allocate pinned pages and translate addresses. This per-op cost is why large batched transfers dominate small ones.
  • Queue depth: modern NVMe devices have up to 64k queues with 64k depth each. Parallelism at this layer is what lets io_uring saturate them.

Concrete Example

An NVMe read, low-level:

  1. Kernel allocates a submission queue entry (SQE) in a per-queue ring buffer in host RAM.
  2. Kernel rings a "doorbell" register on the device (a single MMIO write to a specific address).
  3. Device DMAs the SQE from host RAM, parses the command.
  4. Device reads NAND, DMAs the data into the buffer specified by the SQE.
  5. Device writes a completion queue entry (CQE) into the host's completion queue.
  6. Device raises an MSI-X interrupt on a specific CPU.
  7. Kernel interrupt handler reads the CQE, matches it to a pending request, wakes the waiter.

Because submissions and completions are shared-memory ring buffers and doorbells are single MMIO writes, modern NVMe can sustain millions of IOPS with minimal CPU involvement per request. This is the hardware shape that io_uring (concept 14) mirrors in software.

A simple IDE disk driver (OSTEP's case study) is much simpler: wait-for-ready, write command + LBA to ports, wait for interrupt, read or write data via PIO or simple DMA.

Common Confusion / Misconception

"DMA is zero-copy." DMA moves bytes without CPU cycles, but user-space read still copies from the page cache to the user buffer. True zero-copy requires sendfile, splice, vmsplice, or io_uring's registered buffers, and even those have trade-offs.

"Interrupts are always preferable to polling." At very high rates, polling (NAPI, busy-poll, DPDK) beats interrupts because the overhead per wakeup exceeds the time between events. Network drivers and high-end storage drivers switch modes dynamically.

"The kernel has direct access to devices." On modern systems with virtualization and IOMMU, devices can only touch memory the IOMMU allows. This adds safety and a small cost.

How To Use It

You will rarely write a device driver, but understanding this layer helps with:

  1. Debugging unexpected latency spikes: interrupt coalescing, queue depth, IOMMU faults.
  2. Tuning servers: IRQ affinity, NUMA locality, RPS/RFS on NICs.
  3. Understanding why bypass frameworks (DPDK for NICs, SPDK for NVMe) push the driver into user space to eliminate syscall and interrupt cost entirely.
  4. Reading dmesg and perf output: "nvme0n1: IO timeout, aborting" means something specific, and this concept is the vocabulary.

Check Yourself

  1. Why is DMA necessary for high-throughput I/O? What would PIO look like at 10 GiB/s?
  2. Why does NVMe use multiple submission/completion queues? How does that interact with multi-core CPUs?
  3. Why do high-PPS network drivers sometimes poll ("NAPI") instead of taking every interrupt?

Mini Drill or Application

On a running Linux system:

  1. Run cat /proc/interrupts and identify your disk and network IRQ rates.
  2. Run iostat -x 1 while doing a heavy I/O workload; correlate %util, await, and svctm.
  3. Read one small driver's source, e.g. a simple virtio-blk or loopback driver, and identify: probe, submit, completion, interrupt handler.

Read This Only If Stuck