Skip to main content

Module 4: Transactions & Consistency

Primary text: Designing Data-Intensive Applications (Kleppmann), Chapters 7 and 9 Selective support: Database System Concepts (Silberschatz) for classical concurrency control, Database Internals (Petrov) Part II for implementation-level detail, Distributed Systems Concepts and Design (Coulouris) for the canonical consistency treatment

This module is where single-node transaction guarantees meet distributed reality. You already know replication and partitioning from Module 3. Here you learn what transactions actually guarantee, how isolation is implemented, and which consistency model a distributed system can honestly offer.


Scope of This Module

This module is not "everything about databases." It is where correctness reasoning under concurrency becomes something you can defend.

What it covers in depth:

  • ACID as four distinct guarantees and which ones require which machinery
  • atomicity and durability via write-ahead logging (WAL) and ARIES-style recovery
  • BASE as a deliberately weaker vocabulary and when it fits
  • concurrency anomalies: dirty read, dirty write, lost update, read skew, write skew, phantom
  • the ANSI SQL isolation levels, what each actually prevents, and where the spec is underspecified
  • two-phase locking (2PL) and strict 2PL
  • Snapshot Isolation, MVCC, and why SI is not serializable
  • Serializable Snapshot Isolation (SSI) and how PostgreSQL implements it
  • two-phase commit (2PC), coordinator failure modes, and heuristic decisions
  • three-phase commit and Paxos Commit as repairs of 2PC's blocking problem
  • sagas and compensating actions for long-running distributed workflows
  • linearizability and the single-copy illusion
  • causal consistency, eventual consistency, and session guarantees
  • the CAP theorem, PACELC, and how to read them without oversimplifying

What it deliberately does not try to finish here:

  • consensus protocols in depth (Module 5)
  • full distributed-systems fault models and failure detectors (Module 5)
  • stream processing and event sourcing (later semester)
  • blockchain-specific consistency (out of scope)

Before You Start

Answer these closed-book before starting the main path:

  1. Given a crash mid-transaction, what guarantees the database recovers to a consistent state?
  2. Two clients run UPDATE counter SET n = n + 1. Under what isolation levels can the final value be less than it should be?
  3. What is the difference between repeatable read and snapshot isolation?
  4. Why can a client read its own write but later see an older value somewhere else in a replicated system?
  5. What does CAP actually say, and what does it not say?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path and can spend less time on Cluster 1.

2-3 solid answers

  • Continue, but expect extra time in Cluster 2 (anomalies) and Cluster 5 (consistency models).

0-1 solid answers

  • Revisit Module 3 (replication) briefly. The anomalies and consistency models in this module only make sense on top of a concrete mental model of replicated state.

What This Module Is For

Transactions and consistency are where engineering bugs become correctness bugs. Throughout the program you will repeatedly be asked:

  • does this code correctly update the balance when two users transfer at once?
  • which isolation level do I set, and what does that choice cost me?
  • is the write I just committed visible to the next read from the same session? from a different replica?
  • can this workflow be a single transaction, or must it be a saga with compensations?
  • when my vendor says "strongly consistent," what do they actually mean?

This module builds the reasoning needed for:

  • consensus and replication protocols (Module 5)
  • service and storage architecture in later semesters
  • every production system where money, identity, or inventory is involved

You are learning to stop waving your hands about "the database will handle it."


Concept Map


How To Use This Module

Work in order. Later clusters only make sense if the earlier vocabulary is stable.

Cluster 1: ACID and the Single-Node Transaction

OrderConceptTypeFocus
1ACID Properties: What Each Actually GuaranteesPRIMARYPulling A, C, I, D apart and naming which property needs which machinery
2Atomicity and Durability via WAL and RecoveryPRIMARYWrite-ahead log, redo/undo, checkpointing, ARIES-style recovery
3BASE: The Alternative Vocabulary and Where It FitsSUPPORTINGBasically Available, Soft state, Eventual consistency as a contrast

Cluster mastery check: Can you name which ACID letter each of a given database feature (journaling, constraint checking, isolation level, fsync) actually supports?

Cluster 2: Concurrency Anomalies

OrderConceptTypeFocus
4Dirty Reads, Dirty Writes, Lost UpdatesPRIMARYThe three "obvious" anomalies with interleaved timelines
5Read Skew, Write Skew, Phantom ReadsPRIMARYThe anomalies most people get wrong about SI and RR
6Isolation Levels: RU, RC, RR, SerializablePRIMARYANSI SQL levels, what each prevents, where the spec is vague

Cluster mastery check: Given a schedule of two transactions, can you name the anomaly it exhibits and the weakest isolation level that prevents it?

Cluster 3: Implementing Isolation

OrderConceptTypeFocus
7Two-Phase Locking (2PL) and Serialized SchedulesPRIMARYGrowing and shrinking phases, strict 2PL, deadlock, predicate locks
8Snapshot Isolation and MVCCPRIMARYVersioned reads, first-committer-wins, why SI is not serializable
9Serializable Snapshot Isolation (SSI)SUPPORTINGDetecting dangerous rw-antidependencies on top of SI

Cluster mastery check: For a given workload, can you pick between 2PL, SI, and SSI and defend the choice by anomaly tolerance and contention profile?

Cluster 4: Distributed Transactions

OrderConceptTypeFocus
10Two-Phase Commit (2PC): Coordinator, Participants, Failure ModesPRIMARYPrepare/commit message flow, in-doubt participants, heuristic decisions
11Three-Phase Commit and Paxos CommitSUPPORTINGNon-blocking commit under restricted failure models
12Sagas: Long-Running Transactions with CompensationsPRIMARYOrchestration vs choreography, compensating actions, semantic rollback

Cluster mastery check: For a cross-service workflow, can you decide between 2PC and a saga, and can you list the compensating action for every forward step?

Cluster 5: Consistency Models

OrderConceptTypeFocus
13Linearizability and the Single-Copy IllusionPRIMARYReal-time order, the register model, cost and implementation
14Causal Consistency, Eventual Consistency, Session GuaranteesPRIMARYHappens-before ordering, read-your-writes, monotonic reads
15CAP and PACELC Frameworks and When to Use ThemPRIMARYWhat CAP actually says, and PACELC's extension to latency

Cluster mastery check: Given a vendor's consistency claim ("strong," "causal," "eventual with bounded staleness"), can you translate it into a client-visible behavior contract?

Then work these practice pages:

OrderPractice pathFocus
1Isolation and Anomalies LabReproduce lost-update and write-skew anomalies on PostgreSQL under different isolation settings
2Saga Design WorkshopDesign a saga for a real workflow with explicit compensations
3Consistency Model Translation DrillTranslate vendor claims into client-visible behavior; analyze a Jepsen-style history
4Transactions Code KatasTimed drills on 2PC, 2PL, SI, and linearizability tests

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. State what each of A, C, I, and D actually guarantees, and name the concrete mechanism (WAL, constraints, isolation level, fsync) that supports each.
  2. Walk through an ARIES-style recovery on a small example log and explain why redo precedes undo.
  3. Given a schedule of two transactions, identify the anomaly (dirty read, dirty write, lost update, read skew, write skew, phantom) and state the weakest isolation level that prevents it.
  4. Describe how 2PL, Snapshot Isolation, and SSI each implement a chosen isolation level and where each breaks down.
  5. Explain why Snapshot Isolation prevents phantoms but allows write skew, and reproduce a write-skew violation on PostgreSQL.
  6. Draw the 2PC message flow, name every failure mode (coordinator dies, participant dies before prepare, after prepare), and describe the recovery.
  7. Design a saga for a concrete workflow (e.g., travel booking) with forward actions and semantically correct compensations.
  8. Define linearizability in terms of real-time order and apply it to a short history to decide whether the history is linearizable.
  9. Distinguish causal consistency, eventual consistency, and the four standard session guarantees, and pick which guarantees an application actually needs.
  10. State CAP and PACELC precisely and apply them to categorize a given storage system's tradeoffs without overclaiming.

Outputs

  • a concurrency anomaly catalog: for each of the six core anomalies, a two-row interleaved timeline, the isolation level that permits it, and the one that prevents it
  • a PostgreSQL reproduction log: you reproduced lost update and write skew at READ COMMITTED and REPEATABLE READ, and then eliminated each with the right setting
  • a 2PC state-machine diagram covering coordinator and participant, labeled with every crash-recovery transition
  • a saga design for at least one real workflow, with a table mapping each step to its compensating action and the idempotency requirement
  • a consistency model crib sheet covering linearizability, causal, session guarantees, and eventual, with one example application per row
  • a Jepsen-style analysis of a short history: at least one declared violation with a written argument
  • a short memo "when CAP is the wrong frame" that uses PACELC on the same example
  • a mistake log (at least 8 entries) on misread isolation levels and conflated consistency guarantees

Completion Standard

You have completed Module 4 when all of these are true:

  • you can name the anomaly given a two-transaction schedule and propose the weakest isolation level that prevents it
  • you have reproduced lost update and write skew on a real database and then eliminated each
  • you can defend a choice between 2PC and a saga for a specific workflow
  • you can write a plausible 2PC recovery path, including the "in-doubt" participant case
  • you can argue whether a given history is linearizable by appeal to real-time order
  • you have translated at least two vendor consistency claims into client-visible contracts
  • you no longer conflate "strong consistency" with "serializability" and no longer quote CAP as "pick two"

If you are still saying "it's ACID, so it's fine" without knowing which level, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • External Jepsen and Kleppmann blog posts are validated and targeted; read them when the concept page points to them.
  • Because this module underpins Module 5 (distributed systems) and every architecture module after, hand-written anomaly timelines and written 2PC/saga designs are required, not optional.

Suggested Weekly Flow

DayWork
1Concepts 1-3; write ACID crib sheet by hand
2Concepts 4-5; draw at least 4 anomaly timelines
3Concept 6; reproduce dirty read / lost update on PostgreSQL
4Concepts 7-8; walk through 2PL and SI on the same workload
5Concept 9 and Practice 1 (anomalies lab)
6Concepts 10-11; draw 2PC state machine, list failure modes
7Concept 12 and Practice 2 (saga design)
8Concept 13; work one linearizability history by hand
9Concepts 14-15 and Practice 3
10Practice 4 (katas), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread