Why Databases Care About Disk: Random vs Sequential I/O
What This Concept Is
A database does not see bytes. It sees pages moving through a memory hierarchy. At each level of that hierarchy, two reads that return the same bytes can cost wildly different amounts depending on where the bytes are and how you asked for them.
The useful mental model has three numbers that matter in orders of magnitude:
- CPU register / L1: roughly
1 ns - main memory (RAM): roughly
100 ns - SSD random read: roughly
100 us(about1000xRAM) - HDD random seek: roughly
10 ms(about100,000xRAM)
"Random I/O" means each request pays the full seek/positioning cost. "Sequential I/O" means the device is already positioned where you want, so you mostly pay transfer time. On an HDD the ratio can be 100x; on an SSD it is smaller but still real because of internal block erase, write amplification, and queue depth.
A page is the unit of transfer. Typical sizes are 4 KB, 8 KB, or 16 KB. When the engine says "one I/O," it almost always means "one page."
Why It Matters Here
Every later concept in this module is shaped by this asymmetry.
- B+-trees are wide and shallow so that a point lookup is a few random reads, not thousands.
- LSM-trees batch writes into sorted runs because sequential writes are cheap and random in-place writes are expensive.
- The buffer pool exists so the same page is not re-read from disk on every query.
- Query plans prefer index-only scans when they can avoid heap fetches, because each heap fetch is a potential random I/O.
If you think of disk as "slower RAM," you will misread every design decision that follows.
Concrete Example
Suppose a table has 10^7 rows of 100 bytes each, so roughly 1 GB. You want all rows where user_id = 42, and there are 5 such rows.
- Full table scan: the engine reads every
8 KBpage sequentially. That is~125,000pages. On SSD at500 MB/ssequential, about2 s. This is sequential I/O. - B+-tree index lookup: the engine descends
~3levels of the index (say3page reads), then fetches5heap pages, one per matching row. That is8pages, but each heap fetch is a random read. On SSD at100 usper random read, about0.8 ms.
2 s vs 0.8 ms: a factor of 2500. The index did not "make disk faster." It changed the access pattern.
Now make user_id = 42 match 1,000,000 rows. Suddenly the index does 3 + 1,000,000 random reads, possibly ~100 s. The optimizer should pick the full scan and finish in 2 s. The correct access path depends on selectivity.
Common Confusion / Misconception
"SSDs make this irrelevant." They soften the gap, they do not erase it. SSDs still have:
- page/block boundaries and internal remapping (the FTL)
- garbage collection that can stall writes
- queue-depth effects where sequential large I/O saturates more efficiently
"One read is one read." On Linux, one logical read may translate to zero I/Os (already in page cache), one I/O (cold page), or many I/Os (misaligned access crossing block boundaries). The engine's I/O cost counter and the kernel's view are not the same thing.
How To Use It
Whenever you are analyzing a plan or storage choice:
- Count pages, not rows.
- Classify each access as random or sequential.
- Multiply by the right per-I/O cost for the device.
- Ask whether the pattern could be made sequential (sorting, batching, clustering, compacting).
- Ask what is in the buffer pool, because any page already there costs nothing.
This is the cost model the rest of the module reuses.
Check Yourself
- Why does sorting writes into runs change the physics of a hard disk, and how does this carry over to SSDs?
- If one page is
8 KB, how many pages do you expect for a1 GBtable? - Why can an index scan be slower than a full scan for low-selectivity predicates?
Mini Drill or Application
For each scenario, state which I/O pattern dominates and roughly how many page reads you expect:
SELECT COUNT(*) FROM orderson a50 GBtable with no index on the filter.SELECT * FROM users WHERE id = 123with a primary-key B+-tree index.UPDATE items SET price = price * 1.1on a10 GBtable.SELECT * FROM events WHERE ts BETWEEN '...' AND '...'with a clustered index ontsmatching0.1%of rows.- Appending
1,000,000rows to a log-structured engine vs a B+-tree-indexed table.