Learning Resources
This module is populated from the local chunked books in library/raw/semester-06-databases-distributed/books. Use this page as a source map, not as an instruction to read everything.
Source Stack
| Book | Role | How to use it in this module |
|---|---|---|
| Designing Data-Intensive Applications (Kleppmann) | Primary teaching source | Default escalation for every primary concept. Chapters 8 (The Trouble with Distributed Systems) and 9 (Consistency and Consensus) are the core chunks for this module |
| Distributed Systems Concepts and Design (Coulouris et al.) | Canonical theory source | Formal treatment of time, clocks, failure models, and consensus. Go here for the rigorous textbook view |
| Database Internals (Petrov) | Implementation-level support | Concrete views of failure detection, gossip, ZAB, Paxos, Multi-Paxos, and Raft; most efficient for "how it is actually built" |
| Database System Concepts (Silberschatz et al.) | Peripheral | Limited direct coverage of these topics; used lightly |
Resource Map by Cluster
Cluster 1: The Inescapable Reality
| Need | Best local chunk | Why |
|---|---|---|
| Fallacies (origin and modern restatement) | Database Internals: Fallacies of Distributed Computing | Cleanest single-chunk list and commentary |
| Partial failure intuition | DDIA: Faults and Partial Failures | The single best page on "a distributed system has partial failure" |
| Networks are unreliable | DDIA: Unreliable Networks | Concrete patterns for how networks really fail |
| Cloud vs HPC failure culture | DDIA: Cloud Computing and Supercomputing | Where partial failure is a design choice |
| Timeouts and unbounded delay | DDIA: Timeouts and Unbounded Delays | Why no timeout distinguishes slow from dead |
| Async vs sync networks | DDIA: Synchronous Versus Asynchronous Networks | Formal contrast; partial synchrony motivation |
| Process pauses | DDIA: Process Pauses (Part 1), Part 2 | GC pauses as a distributed failure mode |
| Two generals | Database Internals: Two Generals' Problem | Impossibility motivator |
| System synchrony formalism | Database Internals: System Synchrony | Async vs partial sync vs sync formalized |
| Coulouris challenges | Coulouris: Challenges (Part 1), Part 2, Part 3 | Textbook framing of heterogeneity, failure, scaling |
Cluster 2: Time, Clocks, and Ordering
| Need | Best local chunk | Why |
|---|---|---|
| Why clocks are unreliable | DDIA: Unreliable Clocks | The single best page on this topic |
| NTP accuracy in practice | DDIA: Clock Synchronization and Accuracy | Concrete numbers and failure modes |
| Wall-clock pitfalls | DDIA: Relying on Synchronized Clocks (Part 1), Part 2 | LWW traps and Spanner's TrueTime |
| Implementation-level time | Database Internals: Clocks and Time | Monotonic vs wall-clock at the system level |
| Physical clock synchronization | Coulouris: Synchronizing Physical Clocks (Part 1) | Cristian's algorithm, Berkeley, NTP |
| Logical time (textbook) | Coulouris: Logical Time and Logical Clocks | Lamport and vector clocks rigorously |
| Happens-before and causality | DDIA: Ordering and Causality (Part 1), Part 2 | Happens-before to causal consistency |
| Sequence number ordering | DDIA: Sequence Number Ordering (Part 1) | From Lamport clocks to leader-issued sequence numbers |
| Ordering (implementation) | Database Internals: Ordering | Compact implementation-level summary |
Cluster 3: Failure Detection and Membership
| Need | Best local chunk | Why |
|---|---|---|
| Failure detection overview | Database Internals: Chapter 9 - Failure Detection | Best introduction to the problem |
| Phi-accrual | Database Internals: Phi-Accrual Failure Detector | The core technique Cassandra and others use |
| Failure detection summary | Database Internals: Summary (Chapter 9) | Compact wrap-up of detection protocols |
| Gossip dissemination | Database Internals: Gossip Dissemination | Logarithmic dissemination, SWIM-adjacent design |
| Hybrid gossip | Database Internals: Hybrid Gossip | Performance tuning real gossip |
| Anti-entropy primer | Database Internals: Chapter 12 - Anti-entropy and Dissemination | Context for gossip |
| Gossip (textbook) | Coulouris: Gossip architecture (Part 1) | Bayou-style gossip analysis |
| Omission faults | Database Internals: Omission Faults | The failure model under gossip/heartbeat |
| Byzantine (DDIA) | DDIA: Byzantine Faults | When and why BFT applies |
| System model and reality | DDIA: System Model and Reality | Connecting textbook models to real hardware |
| PBFT algorithm | Database Internals: PBFT Algorithm | Classical BFT reference |
Cluster 4: Consensus
| Need | Best local chunk | Why |
|---|---|---|
| Why consensus is needed | DDIA: Distributed Transactions and Consensus | The problem statement across multiple use cases |
| Fault-tolerant consensus | DDIA: Fault-Tolerant Consensus (Part 1), Part 2, Part 3 | Kleppmann's modern presentation of the consensus problem |
| Consensus chapter | Database Internals: Chapter 14 - Consensus | Implementation-level entry point |
| Paxos | Database Internals: Paxos | Clear basic Paxos |
| Paxos quorums | Database Internals: Quorums in Paxos | Quorum intersection explained |
| Multi-Paxos | Database Internals: Multi-Paxos | From single-value to replicated log |
| Egalitarian Paxos | Database Internals: Egalitarian Paxos | Where the leaderless variant fits |
| Raft | Database Internals: Raft | Core Raft exposition |
| Raft leader role | Database Internals: Leader Role in Raft | Operational model of the leader |
| ZAB | Database Internals: ZAB | ZooKeeper's atomic broadcast protocol |
| Consensus (textbook) | Coulouris: Consensus and related problems (Part 1), Part 2, Part 3 | Formal problem definition, FLP, and Byzantine generals |
Cluster 5: Distributed System Patterns
| Need | Best local chunk | Why |
|---|---|---|
| Leader election introduction | Database Internals: Chapter 10 - Leader Election | Compact introduction to the problem |
| Bully algorithm | Database Internals: Bully Algorithm | Classical leader-election reference |
| Majority as truth | DDIA: The Truth Is Defined by the Majority | Why single-leader designs must check quorums |
| Elections (Coulouris) | Coulouris: Elections (Part 1) | Formal election algorithm treatment |
| Idempotency context | DDIA: Summary (Chapter 8 Part 2) | Wrap-up of the retry-and-duplicate story |
| End-to-end argument | DDIA: The End-to-End Argument for Databases (Part 1), Part 2 | Why exactly-once must live at the application layer |
| Coordination avoidance | Database Internals: Coordination Avoidance | When you can skip the coordination dance entirely |
| Coordination services (DDIA) | DDIA: Membership and Coordination Services | What ZooKeeper/etcd/Consul actually provide |
| Coordination services (Coulouris) | Coulouris: Data storage and coordination services (Part 1), Part 2 | Google Chubby-style treatment |
Exercise Support Chunks
Use these when concept pages are understood but fluency is weak:
- DDIA: Summary (Chapter 8 Part 1)
- DDIA: Summary (Chapter 8 Part 2)
- DDIA: Summary (Chapter 9 Part 1)
- DDIA: Summary (Chapter 9 Part 2)
- Database Internals: Consensus Summary
External Resources (Validated, Read If Pointed Here)
The module links to specific external posts from concept pages. All validated as of the most recent curation pass.
- Lamport: "Time, Clocks, and the Ordering of Events in a Distributed System" (1978) - the foundational paper for logical time. Still the single clearest text on the happens-before relation.
- The Raft Consensus Algorithm (raft.github.io) - the canonical index of Raft resources, including the paper, visualization, and implementation links.
- Ongaro and Ousterhout: "In Search of an Understandable Consensus Algorithm" (USENIX ATC 2014) - the Raft paper.
- Diego Ongaro's Raft PhD dissertation (Stanford, 2014) - deeper treatment including log compaction, membership changes, and operational concerns.
- Lamport: "Paxos Made Simple" (2001) - Lamport's own restatement of Paxos; still terse but the canonical source.
- Lamport: "Paxos Made Live" (Google, Chandra et al., 2007) - production experience running Paxos (Chubby).
- Peter Bailis: "The Network is Reliable" (ACM Queue, 2014) - empirical case studies of real-world partition incidents. A concrete counterpoint to "the network is reliable."
- Jepsen - independent distributed-system correctness analyses. The single best external source for "what goes wrong in practice."
- Martin Kleppmann: "How to do distributed locking" (2016) - the canonical Redlock critique and the fencing-token argument.
- Ongaro: Raft visualization - interactive simulator. Worth one hour if elections feel abstract.
- Werner Vogels: "Eventually Consistent" (ACM Queue, 2008) - classical framing of the consistency trade-off.
- Peter Deutsch: The Eight Fallacies of Distributed Computing - the original list, with commentary.
Use Rules
- For every primary concept, the local book chunk is the escalation. Reach for DDIA (chapters 8 and 9) first; Coulouris for the formal treatment; Database Internals for the implementation view.
- Open one chunk per gap. Do not drift into whole chapters.
- External links are targeted: when a concept page says "external," it means the chunk does not cover that angle well enough.