Skip to main content

Read-Ahead, Write-Back, fsync Semantics

What This Concept Is

Three mechanisms connect the page cache to actual disk I/O:

  • Read-ahead (prefetch): on a sequential read pattern, the kernel reads more than you asked for into the cache so the next read hits. Triggered by access-pattern heuristics.
  • Write-back (dirty flush): writes return as soon as data is in the cache. Kernel flusher threads write dirty pages later, based on dirty-ratio thresholds, dirty-expire times, and vm.dirty_background_ratio / vm.dirty_ratio.
  • fsync(fd): forces all dirty data and metadata for the file backing fd to stable storage before returning. On Linux, it also issues a cache flush (FUA / REQ_FLUSH) to the drive to avoid relying on the drive's volatile cache.
 write(fd, buf, N) --> dirty page in cache -----> RETURNS immediately
|
| (time passes, dirty_expire, memory pressure)
v
kernel flusher thread --> block layer --> drive
|
v
volatile
drive cache
|
| (FLUSH cmd)
v
stable storage

Two related calls with narrower semantics:

  • fdatasync(fd): flushes data and only metadata needed to read that data back (it may skip mtime update). Cheaper than fsync.
  • sync_file_range(fd, ...): flushes a range; does not issue a drive-cache flush, so it does not guarantee durability. Used inside databases that manage their own barriers.
  • sync() / syncfs(fd): flushes the entire FS (or the FS containing fd).

Why It Matters Here

Durability requires fsync (or fdatasync). A write that returns is only in volatile memory, possibly in the drive's volatile DRAM even after the kernel has sent it. Only an fsync that completes successfully, with functioning drive cache-flush semantics, implies "safe against power loss."

Read-ahead is usually invisible but can hurt if the kernel guesses wrong. posix_fadvise(fd, ..., POSIX_FADV_RANDOM) disables it; POSIX_FADV_SEQUENTIAL enables aggressive prefetch. Databases often disable it with O_DIRECT + their own prefetcher.

Write-back lets the FS coalesce and reorder writes, which is great for throughput but dangerous for correctness. Journaling (Cluster 3) and proper fsync discipline close the loop.

Concrete Example

Measuring fsync cost (approximately, on a consumer NVMe):

write 4 KiB to file:           ~5 us    (page cache)
fsync after a 4 KiB write: ~200 us to ~2 ms (flush + drive barrier)
write 4 KiB + fsync, repeated: ~500 to 5,000 ops/s

This is why naive "durable per insert" database patterns cap at a few thousand writes per second. The fixes you will meet: group commit (batch many writes per fsync), WAL (one fsync per batch, many logical writes in memory), and direct I/O with batched flushes.

Read-ahead in action: a cold cat bigfile issues maybe 16-32 KiB block reads initially, then the kernel sees "sequential!" and pushes the window to 128-256 KiB, and ultimately saturates the device with large streaming reads. On a random-read workload, this window stays small.

Common Confusion / Misconception

"fsync returns means data is on the platter." On Linux with working hardware and drivers, yes. On older XFS and some stack combinations there are historical bugs around fsync failure reporting. Read the fsync man page and the "fsyncgate" LWN articles.

"sync_file_range is a faster fsync." No. sync_file_range does not flush the drive cache. It is a tool for initiating writeback, not guaranteeing durability.

"Close() implies fsync." No. close(fd) decrements open counts and returns. A crash after close but before the flusher thread has written can still lose the data.

How To Use It

For any durability claim in your code, ask: did I fsync the file containing the data, and (for new files) did I fsync the directory too? The directory fsync is required because rename or creat updates the directory inode, and without fsync on the directory, the new file may not appear after crash even though the data is intact.

Pattern for "durable new file":

  1. open with O_CREAT.
  2. write all data.
  3. fsync the file.
  4. close the file.
  5. open the parent directory.
  6. fsync the directory.
  7. close the directory.

Or better: write to foo.tmp, fsync, rename to foo, fsync the directory. The rename is atomic, so readers see either the old content or the new content.

Check Yourself

  1. If you call write(fd, buf, 4096) ten times and then the machine loses power, how much of the 40 KiB is guaranteed to survive?
  2. Why do databases sometimes write their own O_DIRECT writer with explicit FLUSH commands instead of relying on fsync?
  3. What does fdatasync skip that fsync includes, and why is that usually safe?

Mini Drill or Application

Write a small program that:

  1. Appends 1 million 512-byte records.
  2. Runs it four ways: (a) no fsync, (b) fsync every record, (c) fsync every 1,000 records, (d) O_DSYNC opened.
  3. Report records/sec and file size after kill -9 mid-run.

Predict the ordering before running; then explain any surprise.

Read This Only If Stuck