Skip to main content

Journaling: Write-Ahead Logs and Commit Ordering

What This Concept Is

A journal (also called a write-ahead log, WAL) is a reserved region of the file system where the FS writes its intent before touching any home location. The discipline:

  1. Write all pending updates into the journal as a transaction: TxBegin, updates, TxEnd.
  2. Force those writes to stable storage.
  3. Write a commit block at the end of the transaction; force that.
  4. Only then apply (check in or checkpoint) the updates to their real locations.
  5. When the checkpoint is durable, the transaction can be erased from the journal.
 journal region:
+------+--------+--------+--------+--------+-------+
| TxB | inode | bitmap | data | data2 | TxE |
+------+--------+--------+--------+--------+-------+
\ \
+----- must land before ----+ +-- commit record
v (only after all else durable)
home regions of the FS: inode table, bitmaps, data blocks

The key invariant: commit block last. A commit block is only valid if all its transaction content was already durable. On recovery, the FS replays each committed transaction and discards incomplete ones.

Why It Matters Here

Journaling trades throughput for correctness and bounded recovery time. After a crash, the FS does not run fsck over the whole disk; it scans only the journal (a bounded, small region) and replays. Recovery is seconds instead of hours.

Two common modes:

  • Metadata journaling (ext3/4 default, data=ordered): logs only metadata (inodes, bitmaps, directory blocks). Data blocks are written to their home before the metadata commit, so a crash does not leave metadata pointing at garbage. Fast; safe against most failures.
  • Data journaling (data=journal): logs data blocks in the journal too, so every write costs ~2x (journal + home). Safest; rarely used.

data=writeback is a third mode that logs metadata but permits data to be written in any order; cheaper but allows previously garbage data to appear in a file after crash. Production Linux defaults to data=ordered.

Concrete Example

Appending 4 KiB to a file, ext4 data=ordered mode:

 Step 1: write new data block D to its home location (block B).
block B is not yet pointed at; it is effectively garbage to readers.

Step 2: write journal: TxB, inode-update, bitmap-update, TxE.
Barrier/flush so all journal blocks are stable.

Step 3: write journal commit block. Barrier/flush.
Transaction is now durable.

Step 4: checkpoint: write updated inode to inode table, updated bitmap
to the bitmap region. These are the homes.

Step 5: once step 4 is durable, the transaction can be erased from
the journal.

Crash before step 3: no commit block exists; recovery ignores the partial log. Data block B is allocated in user's head only; bitmap still says free; no damage.

Crash after step 3 but before step 4: on recovery, the FS finds a committed transaction in the journal. It replays the metadata updates against the home regions. The file is complete.

Crash after step 4: the transaction can be (and eventually is) erased from the journal. The home regions are already up to date. Recovery has nothing to replay.

At no point does a crash leave metadata pointing at unwritten data, because the data is ordered before the commit.

Common Confusion / Misconception

"The journal is just a backup of metadata." The journal is forward-only: it records changes as a stream. Old values need not be in it unless undo-style recovery is needed (which journaling filesystems do not do).

"Journaling makes writes slow because everything is written twice." Only data journaling writes everything twice. Metadata journaling writes metadata twice (a small fraction) and data once.

"ext4 survives any crash." No. It survives any single power failure. It does not protect against drive firmware bugs, silent bit rot, misdirected writes, or whole-disk loss. For those, use checksums (btrfs, zfs) or RAID-style redundancy.

How To Use It

When reasoning about a journaling FS under failure, apply two rules:

  1. Commit barrier: if the commit block is durable, all earlier entries in the transaction are durable.
  2. Checkpoint ordering (ordered mode): data is durable at its home before the metadata commit, so metadata never points at unwritten data.

Together these define what survives. Apply them to any operation and you can state its post-crash invariants.

Check Yourself

  1. Why is the commit block a single small write at the end, rather than a flag embedded in the first block?
  2. What is the worst that can happen under data=writeback mode?
  3. Why does write-ahead logging in a database (postgres, mysql) follow the same pattern?

Mini Drill or Application

Design the journal entry layout for an ext-style FS: what does TxB, metadata blocks, and the commit block each contain? How does recovery identify a transaction as complete?

Then trace rename("/a/x", "/b/y") under metadata journaling: which blocks are in the transaction? Where is the commit block? What survives a crash after the commit?

Read This Only If Stuck