Write-Ahead Logging (WAL) and Recovery with ARIES Sketch

What This Concept Is

A database must survive a crash. The write-ahead logging (WAL) rule is the hinge of every durable engine:

Before a modified page is written back to disk, the log record describing the modification must already be on stable storage.

With that one rule, recovery becomes possible. If the engine crashes after the log record is durable but before the data page is written, we can redo from the log. If it crashes with the modification only in memory, we have the log record to know whether to redo or discard.

WAL records typically store:

the transaction id
the page id and offset
the before-image and after-image (or enough to reconstruct them)
a log sequence number (LSN), monotonically increasing
the previous LSN for the same transaction (a back-pointer chain for undo)

Commit is a log flush. COMMIT returns to the user only after the commit record is fsync-ed to the log. That single fsync is the durability boundary. Everything else (data pages, index updates) can be lazy.

ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) is the canonical recovery algorithm, by Mohan et al. (1992). Simplified sketch in three phases after a crash:

Analysis: scan the log forward from the last checkpoint to rebuild the transaction table (who was live at crash) and the dirty page table (which pages had unlogged changes).
Redo: scan forward from the earliest recLSN in the dirty page table, reapplying every log record whose effect is not already reflected in the page on disk (identified by comparing record LSN to page LSN). This includes records of transactions that later aborted -- "repeating history."
Undo: for each transaction that was live at crash, walk backward along its log chain and reverse its actions using compensation log records (CLRs), which are themselves logged so a second crash mid-undo does not lose progress.

Checkpoints bound the amount of work recovery must do. A checkpoint writes a record noting the active transactions and flushes selected dirty pages. Fuzzy checkpointing (the kind ARIES uses) does not stop the system -- it simply marks a known-good starting point for Analysis.

Why It Matters Here

Every modern engine is an ARIES descendant: PostgreSQL, InnoDB, SQL Server, DB2, Oracle. The LSM engines (RocksDB) have their own WAL on top of SSTables; the rule is the same.

This is the concept that ties Module 2 together. Everything you have learned -- slotted pages, buffer pool, B+-tree updates, MVCC version chains -- all of it is logged to WAL first. When you hear "durability," it means the commit log is on disk.

Concrete Example -- A Commit and a Crash

T1: BEGIN                          (LSN=100) log: BEGIN T1
T1: UPDATE A.x = 1                 (LSN=101) log: T1, page=A, before=0, after=1
T1: UPDATE B.y = 2                 (LSN=102) log: T1, page=B, before=_, after=2
T1: COMMIT                         (LSN=103) log: COMMIT T1  -> fsync log
                                                 <-- ack to client

# pages A and B still dirty in buffer pool, not yet written

CRASH

On restart:

Analysis: from last checkpoint, see T1 started and committed by LSN 103.
Redo: for each logged update, compare page LSN on disk to record LSN; if disk LSN < record LSN, reapply. Here both A and B are stale, so redo both.
Undo: no live losers to undo (T1 committed).

Durability preserved. If T1 had not yet committed when the crash happened, step 3 would roll back its modifications using the before-images and emit CLRs.

Concrete Example -- Crash Mid-Undo

If the engine crashes again while it was undoing T1:

on restart, Analysis rediscovers T1 as a loser
Redo "repeats history": applies the original updates and the CLRs already written
Undo continues from where CLRs left off, using the undoNext pointer each CLR stores

The system converges in a bounded number of restarts. This is why CLRs are logged: they are how ARIES makes undo idempotent.

Common Confusion / Misconception

"The data page is authoritative." The log is authoritative. Until recovery runs, the page file can be in an arbitrary mix of old and new. Recovery reconciles it against the log.

"Checkpoint means full flush." Fuzzy checkpoints do not stop the system or force all dirty pages out. They mark a point from which Analysis can safely begin. Full flushing is too expensive to do often.

"WAL doubles write volume." WAL writes are sequential -- far cheaper per byte than the random data-page writes they replace. Real write amplification is closer to 1.5x-2x, not 2x, and the sequentiality pays off many times over.

How To Use It

When thinking about durability and recovery:

Confirm fsync-on-commit is actually enabled (look for synchronous_commit = on, innodb_flush_log_at_trx_commit = 1, etc.).
Know your checkpoint interval -- it bounds recovery time.
Know whether your engine uses physical logging (change a specific byte), logical logging (a SQL statement), or physiological (change within a page, identified by offset). ARIES is physiological.
For replication (Module 3), WAL is almost always what is shipped to followers.
If you are asked "how much data can I lose on a crash?" the honest answer is "at most the writes that had not yet reached the log when the OS lost power." Configure storage accordingly.

Check Yourself

Why does the WAL rule require the log to reach disk before the modified page, not simultaneously?
Why does ARIES "repeat history" in redo, including the work of transactions that later aborted?
What job do compensation log records do, and why are they themselves logged?

Mini Drill or Application

Given this log tail (oldest first) and a crash right after the last record:

LSN=200  CHECKPOINT  active={T1}  dirty={A:recLSN=198}
LSN=201  T1  UPDATE A before=0 after=1
LSN=202  T2  BEGIN
LSN=203  T2  UPDATE B before=0 after=5
LSN=204  T1  COMMIT
LSN=205  T2  UPDATE C before=0 after=9
CRASH

Which transactions are winners? Which are losers?
Which pages need redo? Starting from which LSN?
What does undo do, in order, for T2?
If the engine crashes again during step 3, describe how it resumes.

What This Concept Is​

Why It Matters Here​

Concrete Example -- A Commit and a Crash​

Concrete Example -- Crash Mid-Undo​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​