Write-Ahead Logging (WAL) and Recovery with ARIES Sketch
What This Concept Is
A database must survive a crash. The write-ahead logging (WAL) rule is the hinge of every durable engine:
Before a modified page is written back to disk, the log record describing the modification must already be on stable storage.
With that one rule, recovery becomes possible. If the engine crashes after the log record is durable but before the data page is written, we can redo from the log. If it crashes with the modification only in memory, we have the log record to know whether to redo or discard.
WAL records typically store:
- the transaction id
- the page id and offset
- the before-image and after-image (or enough to reconstruct them)
- a log sequence number (LSN), monotonically increasing
- the previous LSN for the same transaction (a back-pointer chain for undo)
Commit is a log flush. COMMIT returns to the user only after the commit record is fsync-ed to the log. That single fsync is the durability boundary. Everything else (data pages, index updates) can be lazy.
ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) is the canonical recovery algorithm, by Mohan et al. (1992). Simplified sketch in three phases after a crash:
- Analysis: scan the log forward from the last checkpoint to rebuild the transaction table (who was live at crash) and the dirty page table (which pages had unlogged changes).
- Redo: scan forward from the earliest
recLSNin the dirty page table, reapplying every log record whose effect is not already reflected in the page on disk (identified by comparing record LSN to page LSN). This includes records of transactions that later aborted -- "repeating history." - Undo: for each transaction that was live at crash, walk backward along its log chain and reverse its actions using compensation log records (CLRs), which are themselves logged so a second crash mid-undo does not lose progress.
Checkpoints bound the amount of work recovery must do. A checkpoint writes a record noting the active transactions and flushes selected dirty pages. Fuzzy checkpointing (the kind ARIES uses) does not stop the system -- it simply marks a known-good starting point for Analysis.
Why It Matters Here
Every modern engine is an ARIES descendant: PostgreSQL, InnoDB, SQL Server, DB2, Oracle. The LSM engines (RocksDB) have their own WAL on top of SSTables; the rule is the same.
This is the concept that ties Module 2 together. Everything you have learned -- slotted pages, buffer pool, B+-tree updates, MVCC version chains -- all of it is logged to WAL first. When you hear "durability," it means the commit log is on disk.
Concrete Example -- A Commit and a Crash
T1: BEGIN (LSN=100) log: BEGIN T1
T1: UPDATE A.x = 1 (LSN=101) log: T1, page=A, before=0, after=1
T1: UPDATE B.y = 2 (LSN=102) log: T1, page=B, before=_, after=2
T1: COMMIT (LSN=103) log: COMMIT T1 -> fsync log
<-- ack to client
# pages A and B still dirty in buffer pool, not yet written
CRASH
On restart:
- Analysis: from last checkpoint, see T1 started and committed by LSN 103.
- Redo: for each logged update, compare page LSN on disk to record LSN; if disk LSN < record LSN, reapply. Here both A and B are stale, so redo both.
- Undo: no live losers to undo (T1 committed).
Durability preserved. If T1 had not yet committed when the crash happened, step 3 would roll back its modifications using the before-images and emit CLRs.
Concrete Example -- Crash Mid-Undo
If the engine crashes again while it was undoing T1:
- on restart, Analysis rediscovers
T1as a loser - Redo "repeats history": applies the original updates and the CLRs already written
- Undo continues from where CLRs left off, using the
undoNextpointer each CLR stores
The system converges in a bounded number of restarts. This is why CLRs are logged: they are how ARIES makes undo idempotent.
Common Confusion / Misconception
"The data page is authoritative." The log is authoritative. Until recovery runs, the page file can be in an arbitrary mix of old and new. Recovery reconciles it against the log.
"Checkpoint means full flush." Fuzzy checkpoints do not stop the system or force all dirty pages out. They mark a point from which Analysis can safely begin. Full flushing is too expensive to do often.
"WAL doubles write volume." WAL writes are sequential -- far cheaper per byte than the random data-page writes they replace. Real write amplification is closer to 1.5x-2x, not 2x, and the sequentiality pays off many times over.
How To Use It
When thinking about durability and recovery:
- Confirm
fsync-on-commit is actually enabled (look forsynchronous_commit = on,innodb_flush_log_at_trx_commit = 1, etc.). - Know your checkpoint interval -- it bounds recovery time.
- Know whether your engine uses physical logging (change a specific byte), logical logging (a SQL statement), or physiological (change within a page, identified by offset). ARIES is physiological.
- For replication (Module 3), WAL is almost always what is shipped to followers.
- If you are asked "how much data can I lose on a crash?" the honest answer is "at most the writes that had not yet reached the log when the OS lost power." Configure storage accordingly.
Check Yourself
- Why does the WAL rule require the log to reach disk before the modified page, not simultaneously?
- Why does ARIES "repeat history" in redo, including the work of transactions that later aborted?
- What job do compensation log records do, and why are they themselves logged?
Mini Drill or Application
Given this log tail (oldest first) and a crash right after the last record:
LSN=200 CHECKPOINT active={T1} dirty={A:recLSN=198}
LSN=201 T1 UPDATE A before=0 after=1
LSN=202 T2 BEGIN
LSN=203 T2 UPDATE B before=0 after=5
LSN=204 T1 COMMIT
LSN=205 T2 UPDATE C before=0 after=9
CRASH
- Which transactions are winners? Which are losers?
- Which pages need redo? Starting from which LSN?
- What does undo do, in order, for
T2? - If the engine crashes again during step 3, describe how it resumes.
Read This Only If Stuck
- Database Internals: Chapter 5 Transaction Processing and Recovery
- Database Internals: Recovery
- Database Internals: Log Semantics
- Database Internals: ARIES
- Database System Concepts: 19.3 Recovery and Atomicity (Part 1)
- Database System Concepts: 19.4 Recovery Algorithm (Part 1)
- Database System Concepts: 19.4 Recovery Algorithm (Part 2)