Skip to main content

Page Cache and Buffer Cache

What This Concept Is

The kernel interposes a memory cache between user I/O and the disk. Historically Linux had two caches:

  • Buffer cache: keyed by (device, block_number), used by the block layer, file systems (bitmaps, inodes), and tools that see the raw device (dd if=/dev/sda).
  • Page cache: keyed by (inode, offset) in page-sized (typically 4 KiB) chunks, used by regular file I/O and mmap.

Since kernel 2.4, Linux unified them: page cache is the primary, and the "buffer cache" is a set of buffer_head structures layered over page-cache pages for block-device access. On modern Linux "buffer cache" usually means "the page cache accessed by block number."

  user space   -> read(fd, buf, n)          mmap -> load pointer
| |
v v
+------------- page cache --------------+ <-- keyed by (inode, offset)
|
| dirty page writeback / read miss
v
+-------- block layer --------+
|
v
disk/device

A read either:

  • hits the page cache: copy from kernel page to user buffer, ~100 ns, no I/O
  • misses: allocate page, submit block I/O, wait, copy

A write typically:

  • marks the target page dirty and returns immediately
  • later, writeback kernel threads (pdflush, then flusher, now [bdi-*] kthreads) flush dirty pages to disk based on time and dirty-ratio thresholds

Why It Matters Here

Almost all "I/O performance" wins and losses happen in this cache:

  • Hot file reads return from memory at DRAM speed, not disk speed. This is why sequential reads of a 100 MiB file run at 5 GiB/s the second time.
  • Writeback is asynchronous: write returning does not mean data is on disk. The semantics gap is exactly what Cluster 4 concept 11 (fsync) addresses.
  • The cache holds dirty pages that the FS has not yet persisted. These are lost on crash unless fsync or journal commit has forced them.

The page cache also unifies with virtual memory: page cache pages are normal memory pages. The OS can evict them under pressure like any other page.

Concrete Example

  strace -c cat /boot/vmlinuz     # cold
read: 8 calls, ~80 ms total
strace -c cat /boot/vmlinuz # warm
read: 8 calls, ~2 ms total

The cold run hits disk; the warm run returns from page cache. You can demonstrate explicitly:

  sync
echo 3 > /proc/sys/vm/drop_caches # drop page cache
time cat big.file > /dev/null # observe cold time
time cat big.file > /dev/null # observe warm time

free -h shows memory usage with a distinct "buff/cache" column. This is not overhead; Linux uses free memory as cache aggressively. "Free" memory you are not using is wasted memory, hence the Linux dictum: "unused RAM is useless RAM."

Common Confusion / Misconception

"The cache is separate from free memory." On Linux, page cache is counted as available memory because it can be evicted on allocation pressure. That is why free shows both "used" (applications) and "buff/cache" (reclaimable).

"A read always hits disk." Only the first one for a given page. Subsequent reads within the same region serve from cache. This is why microbenchmarks that do not drop caches massively overestimate real-world throughput.

"O_DIRECT skips the page cache, so it is faster." Usually slower for general workloads because it bypasses prefetching and write merging. O_DIRECT is for applications (databases) that maintain their own cache and want to avoid double-buffering.

How To Use It

When analyzing performance, always ask:

  1. Is the working set in the page cache? (Compare to total RAM minus reservations.)
  2. Is write returning because data is persisted or because it landed in cache? (Answer: the cache, until you fsync.)
  3. Is a cache miss due to cold cache or cache eviction? (vmstat, sar, and pressure counters help.)
  4. Is the cost of a "cached" workload actually lock contention on cache data structures rather than I/O? (On many-core systems, yes sometimes.)

Check Yourself

  1. Why does a fresh boot perform worse on a disk-heavy workload than a long-running system?
  2. When does O_DIRECT help? When does it hurt?
  3. Why does free show most RAM as "used" on a system that is running lightly?

Mini Drill or Application

Use vmstat 1, iostat -x 1, and cat /proc/meminfo:

  1. Read a 1 GiB file after drop_caches. Observe bi (block in) activity.
  2. Read it again immediately. Observe zero bi activity.
  3. Run a memory-hungry program that allocates 8 GiB. Observe page cache shrinkage and possibly the file re-reading from disk next time.

Read This Only If Stuck