Skip to main content

The Buffer Pool and Page Replacement

What This Concept Is

The buffer pool is a fixed-size array of frames in main memory. Each frame can hold one page. When the engine needs page P, it does one of three things:

  1. Hit: P is already in some frame. Return a pointer. Cost: near zero.
  2. Miss, free frame available: read P from disk into a free frame. One I/O.
  3. Miss, no free frames: pick a victim frame to evict, possibly write it back if dirty, then load P. One or two I/Os.

Every frame carries metadata: page_id, pin_count (how many threads currently need it), dirty (has it been modified since being read), and a reference bit or access counter used by the replacement policy.

Rules that are not negotiable:

  • A pinned page cannot be evicted.
  • A dirty page must be written back before eviction (otherwise the modification is lost).
  • The page table maps page_id -> frame and is consulted on every logical read.

Replacement policies you should know:

  • LRU: evict the least recently used. Simple, but one sequential scan of a large table can flush everything useful out of the pool.
  • Clock: approximate LRU using a circular scan and a reference bit. Cheaper, widely used.
  • LRU-K: track the last K access times and evict based on the K-th most recent. Resists the scan-flood problem because a page that has only been touched once is a worse citizen than one touched K times.

Why It Matters Here

Every later cost claim in the module ("a lookup is 3 I/Os") assumes the buffer pool is doing its job. Specifically:

  • the top levels of a B+-tree root and internal nodes are almost always in the pool; only the leaf may miss
  • an LSM read against a memtable never touches disk; an LSM read against an old SSTable likely will
  • a bad replacement policy under a scan-heavy workload can invalidate all of the above

If you ignore the buffer pool, your I/O estimates will be off by orders of magnitude in either direction.

Concrete Example

A pool of 4 frames, using LRU, receives this access sequence: A, B, C, D, A, E, A, B, F.

step  access  frames (LRU order, oldest -> newest)   action
1 A [A] miss, load A
2 B [A,B] miss
3 C [A,B,C] miss
4 D [A,B,C,D] miss, pool full
5 A [B,C,D,A] hit
6 E [C,D,A,E] miss, evict B
7 A [C,D,E,A] hit
8 B [D,E,A,B] miss, evict C
9 F [E,A,B,F] miss, evict D

Hit rate: 2 / 9. Now suppose step 6 had been a full-table scan that touched 1000 unique pages. Under LRU, everything useful would be evicted; under LRU-K with K=2, the scan pages would each have only one access, so A (touched twice) would stay resident.

Common Confusion / Misconception

"Buffer pool hits are free." They are cheap, but not free: each logical read still does a hash lookup in the page table, increments a pin counter, and walks replacement-policy metadata. On very hot paths this shows up in profiles.

"Eviction only happens under memory pressure." The writer thread may also flush dirty pages proactively (background flush) to keep the pool clean enough to evict quickly. That is why checkpoint and bgwriter appear in performance discussions.

How To Use It

When a query is slow and you suspect storage:

  1. Get the cache hit ratio. If it is already ~99%, the bottleneck is probably not I/O.
  2. If it is low, ask what is polluting the pool (large scans, cold partitions, too many indexes).
  3. Check dirty-page behavior: if the writer is always busy, writes may be bottlenecked at eviction, not at disk.
  4. Only then think about adding memory or changing the replacement policy.

This is also the lens for reading the LSM cluster later: LSM engines effectively run two buffer tiers -- the memtable (their own in-RAM store) and the OS page cache.

Check Yourself

  1. Why is a pinned page exempt from eviction, and what must the engine guarantee about pinning?
  2. How does LRU-K resist a single large sequential scan better than plain LRU?
  3. Why does an engine sometimes write a dirty page even when there is no memory pressure?

Mini Drill or Application

For a pool of size 3 and the access sequence 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5, compute the number of misses under:

  1. FIFO
  2. LRU
  3. Clock (reference bit reset on a full rotation)

Then explain in one sentence each why the results differ.

Read This Only If Stuck