Skip to main content

Why Databases Care About Disk: Random vs Sequential I/O

What This Concept Is

A database does not see bytes. It sees pages moving through a memory hierarchy. At each level of that hierarchy, two reads that return the same bytes can cost wildly different amounts depending on where the bytes are and how you asked for them.

The useful mental model has three numbers that matter in orders of magnitude:

  • CPU register / L1: roughly 1 ns
  • main memory (RAM): roughly 100 ns
  • SSD random read: roughly 100 us (about 1000x RAM)
  • HDD random seek: roughly 10 ms (about 100,000x RAM)

"Random I/O" means each request pays the full seek/positioning cost. "Sequential I/O" means the device is already positioned where you want, so you mostly pay transfer time. On an HDD the ratio can be 100x; on an SSD it is smaller but still real because of internal block erase, write amplification, and queue depth.

A page is the unit of transfer. Typical sizes are 4 KB, 8 KB, or 16 KB. When the engine says "one I/O," it almost always means "one page."

Why It Matters Here

Every later concept in this module is shaped by this asymmetry.

  • B+-trees are wide and shallow so that a point lookup is a few random reads, not thousands.
  • LSM-trees batch writes into sorted runs because sequential writes are cheap and random in-place writes are expensive.
  • The buffer pool exists so the same page is not re-read from disk on every query.
  • Query plans prefer index-only scans when they can avoid heap fetches, because each heap fetch is a potential random I/O.

If you think of disk as "slower RAM," you will misread every design decision that follows.

Concrete Example

Suppose a table has 10^7 rows of 100 bytes each, so roughly 1 GB. You want all rows where user_id = 42, and there are 5 such rows.

  • Full table scan: the engine reads every 8 KB page sequentially. That is ~125,000 pages. On SSD at 500 MB/s sequential, about 2 s. This is sequential I/O.
  • B+-tree index lookup: the engine descends ~3 levels of the index (say 3 page reads), then fetches 5 heap pages, one per matching row. That is 8 pages, but each heap fetch is a random read. On SSD at 100 us per random read, about 0.8 ms.

2 s vs 0.8 ms: a factor of 2500. The index did not "make disk faster." It changed the access pattern.

Now make user_id = 42 match 1,000,000 rows. Suddenly the index does 3 + 1,000,000 random reads, possibly ~100 s. The optimizer should pick the full scan and finish in 2 s. The correct access path depends on selectivity.

Common Confusion / Misconception

"SSDs make this irrelevant." They soften the gap, they do not erase it. SSDs still have:

  • page/block boundaries and internal remapping (the FTL)
  • garbage collection that can stall writes
  • queue-depth effects where sequential large I/O saturates more efficiently

"One read is one read." On Linux, one logical read may translate to zero I/Os (already in page cache), one I/O (cold page), or many I/Os (misaligned access crossing block boundaries). The engine's I/O cost counter and the kernel's view are not the same thing.

How To Use It

Whenever you are analyzing a plan or storage choice:

  1. Count pages, not rows.
  2. Classify each access as random or sequential.
  3. Multiply by the right per-I/O cost for the device.
  4. Ask whether the pattern could be made sequential (sorting, batching, clustering, compacting).
  5. Ask what is in the buffer pool, because any page already there costs nothing.

This is the cost model the rest of the module reuses.

Check Yourself

  1. Why does sorting writes into runs change the physics of a hard disk, and how does this carry over to SSDs?
  2. If one page is 8 KB, how many pages do you expect for a 1 GB table?
  3. Why can an index scan be slower than a full scan for low-selectivity predicates?

Mini Drill or Application

For each scenario, state which I/O pattern dominates and roughly how many page reads you expect:

  1. SELECT COUNT(*) FROM orders on a 50 GB table with no index on the filter.
  2. SELECT * FROM users WHERE id = 123 with a primary-key B+-tree index.
  3. UPDATE items SET price = price * 1.1 on a 10 GB table.
  4. SELECT * FROM events WHERE ts BETWEEN '...' AND '...' with a clustered index on ts matching 0.1% of rows.
  5. Appending 1,000,000 rows to a log-structured engine vs a B+-tree-indexed table.

Read This Only If Stuck