The Crash Consistency Problem
What This Concept Is
A single file system operation (append, create, truncate) typically updates multiple on-disk structures that live in different locations. The disk can only commit one write at a time. If power is lost between writes, the file system is in an intermediate state that may violate its own invariants.
The canonical example: append a new 4 KiB block to a file on an ext-style FS. Three updates are needed:
- Data bitmap: mark the new block as allocated.
- Inode: update size; add pointer to the new data block.
- Data block: write the new user data.
Six possible orderings, four of which leave the FS inconsistent after a crash:
Case Bitmap Inode Data Post-crash state
+-----------------------------+-------+-------+-------+---------------------------+
| B wrote, others no | Y | N | N | Block leaked (used, never |
| | | | | pointed to) |
+-----------------------------+-------+-------+-------+---------------------------+
| I wrote, others no | N | Y | N | Dangling pointer to free |
| | | | | block; garbage data |
+-----------------------------+-------+-------+-------+---------------------------+
| D wrote, others no | N | N | Y | No-op (good!) |
+-----------------------------+-------+-------+-------+---------------------------+
| B+I, no D | Y | Y | N | File grew, points to old |
| | | | | garbage as its new block |
+-----------------------------+-------+-------+-------+---------------------------+
| B+D, no I | Y | N | Y | Leaked block, orphan data |
+-----------------------------+-------+-------+-------+---------------------------+
| I+D, no B | N | Y | Y | Valid-looking file whose |
| | | | | block is flagged free! |
+-----------------------------+-------+-------+-------+---------------------------+
Most of these are subtly bad. "Valid-looking file with a free block" is especially insidious: the allocator may later hand the same block to another file, silently corrupting both.
Why It Matters Here
Every later concept in this cluster is an answer to this problem. Journaling (concept 08) chooses a deliberate order and logs intent before touching homes. COW (concept 09) avoids overwriting anything in place. fsck is a brute-force recovery pass that was the only answer before journals existed.
The problem is not specific to file systems. Any system that updates multiple persistent structures has a version of it: databases (WAL), distributed systems (two-phase commit), virtual memory (swap files). Learning the FS version teaches a transferable skill.
Concrete Example
Walkthrough: a process appends 4 KiB to an existing file, block size 4 KiB, ext-style. Initial state: inode I with size 0, data bitmap has bit B clear, data block B holds garbage.
Step 1: write bitmap (bit B = 1)
Step 2: write inode (size = 4096, blocks[0] = B)
Step 3: write data (block B = "new data")
Power fails after step 2. On remount:
- The data bitmap says
Bis allocated. - The inode says the file is 4096 bytes, pointed at
B. - Block
Bcontains garbage.
A read returns garbage labeled as user data. Worse: if the user trusted the write, downstream logic will corrupt further.
Reverse the order:
Step 1: write data (B = "new data")
Step 2: write bitmap (bit B = 1)
Step 3: write inode (size = 4096, pointer = B)
Power fails after step 2: data block has new content, bitmap says used, inode has not changed. The user data is orphaned (no inode points to it) but no one is misled.
Neither order is safe enough on its own. The modern answer: log the intent first, then apply, then mark committed.
Common Confusion / Misconception
"The kernel writes these in order, so I'm safe." The kernel schedules writes; the disk may reorder them. Drive caches add another reorder. Without barriers or FUA (Force Unit Access) flags, "I issued A then B" becomes "B is on platter, A is in cache."
"This is only a problem for old HDDs." The disk cache problem is worse on modern SSDs because their internal FTL can reorder aggressively. Journaling and barriers are essential regardless of hardware.
"fsync prevents the problem." fsync forces data and metadata for one file out of the cache. It does not make a multi-step operation atomic. Between two fsyncs, a crash is possible.
How To Use It
For any FS operation you design or analyze, list the structures it modifies, enumerate every possible crash ordering, and check whether each leaves the FS recoverable. If any ordering is unrecoverable, you need a recovery mechanism: journaling, COW, or a consistency checker that can reconstruct state.
Check Yourself
- Why is "data last" (bitmap, inode, then data) worse than "data first" (data, bitmap, inode)?
- What invariant does a consistency checker rely on to detect a leaked block?
- Why does simply flushing writes in order not suffice without drive-level cache control?
Mini Drill or Application
For each operation, list the on-disk updates and circle the worst possible crash ordering:
rename("a", "b")within one directory.- Overwriting the first byte of a 1 GiB file in place.
- Extending a file by writing past its current end by 100 KiB.
unlink("a")on a file with two hard links.- Creating a new file in a full directory (forcing a new directory data block).