Atomicity and Durability via WAL and Recovery
What This Concept Is
Atomicity and durability are implemented by the same underlying mechanism: the write-ahead log (WAL). The rule is simple and unforgiving: a modification to a data page is never considered durable until its log record has been forced to stable storage. This single rule makes A and D tractable in the presence of crashes.
Key pieces:
- Log records: every change (insert, update, delete) writes an entry describing old value (for undo) and new value (for redo), plus a log sequence number (LSN) that totally orders log records.
- WAL rule: flush the log record before the corresponding data page reaches disk.
- Commit marker: a COMMIT record in the log. When this record is on stable storage, the transaction is durable; until then, it can be rolled back.
- Checkpoints: periodic records that summarize "everything up to LSN X is on disk in the data files," bounding recovery work.
- Recovery: on startup after a crash, scan the log from the last checkpoint. Redo committed work. Undo in-progress work. The canonical algorithm is ARIES: analysis, redo, undo.
Why It Matters Here
If you do not understand WAL, statements like "group commit," "synchronous replication," and "fsync matters" sound interchangeable when they are not. You also cannot debug why a COMMIT sometimes takes 20ms (fsync latency) or why a database can lose work on power loss despite returning success (fsync lied because of volatile caches). Every later durability claim, including synchronous replication, is WAL-shaped.
Concrete Example
Schedule of two transactions T1 (credits account A) and T2 (credits account B), both active when the system crashes after T1 commits but before T2 commits.
LSN 10 T1 BEGIN
LSN 11 T1 UPDATE A: old=500 new=600
LSN 12 T2 BEGIN
LSN 13 T2 UPDATE B: old=200 new=300
LSN 14 T1 COMMIT <- log forced to disk
-- CRASH before data pages written --
Recovery:
- Analysis: reads from last checkpoint. Determines T1 committed (has COMMIT record), T2 did not.
- Redo: replays LSN 11 and LSN 13 against the data pages (redo is applied for all transactions, even losers, to reach a known state).
- Undo: rolls back T2's effect at LSN 13 by writing old=200 back. Each undo action itself logs a compensation log record (CLR) so re-crashes during recovery still recover.
Final durable state: A = 600 (T1 committed), B = 200 (T2 rolled back). Atomicity preserved for T2, durability preserved for T1.
Common Confusion / Misconception
"Logging is about audit." No. Logging here is the primary persistence mechanism. The log is the source of truth at commit time; data files catch up lazily.
"fsync makes writes durable." fsync asks the OS to force buffers to the disk; the disk can still have a volatile cache. Databases like PostgreSQL expose synchronous_commit and require battery-backed or power-loss-protected disks for true durability.
"Redo is for committed, undo is for uncommitted." ARIES redoes all logged changes (committed or not) to reach the state at crash time, then undoes only uncommitted transactions. The nuance matters because it lets the recovery algorithm use the same replay machinery for both.
How To Use It
When reasoning about a DB's durability story, ask these in order:
- Does it write a WAL record before the data page?
- Is
fsyncissued on COMMIT, and is the disk honest about it? - On crash, can the recovery algorithm rebuild all committed state from WAL alone?
- Is the recovery bounded by checkpoints or must it scan the entire log?
If any answer is "no" or "depends," that is where your durability story has a seam.
Check Yourself
- Why must a log record be flushed before the corresponding data page, and not the other way around?
- What is the role of checkpoints in bounding recovery time?
- During recovery, why is redo applied even to transactions that will be rolled back in the undo phase?
- What is a compensation log record (CLR) and why does it matter for crashes during recovery?
- If your disk has a volatile write cache that lies about fsync, which ACID property is at risk and how?
Mini Drill or Application
Given this log (LSN: action):
100 T1 BEGIN
101 T1 UPDATE x: 5 -> 7
102 CHECKPOINT (dirty: T1; data file x=5)
103 T2 BEGIN
104 T2 UPDATE y: 2 -> 4
105 T1 COMMIT
106 T3 BEGIN
107 T3 UPDATE z: 8 -> 9
-- CRASH --
- Which log records are redone during recovery?
- Which transactions are undone, and what do their undo records write back?
- What is the final state of x, y, z after recovery?
- Why is the checkpoint at LSN 102 useful even though T1 had not yet committed?