Skip to main content

Module 5: Distributed Systems Fundamentals

Primary text: Distributed Systems: Concepts and Design (Coulouris et al., 5th ed.) chapters 1-2, 14-15, 18 Selective support: Designing Data-Intensive Applications (Kleppmann) chapters 8 and 9, Database Internals (Petrov) Part II chapters 8-14 for consensus, failure detection, and gossip; Ongaro's Raft paper for one canonical walkthrough

This module is where you stop treating the network as reliable, stop treating time as a shared truth, and start reasoning about partial failure as the defining property of distributed systems. You already know replication (Module 3), transactions, and consistency (Module 4). Here you learn the mechanisms that make those ideas safe under asynchrony and crashes: logical time, failure detection, and consensus.


Scope of This Module

This module is not "a catalog of distributed algorithms." It is where you build the vocabulary and reasoning habits to argue about correctness in a system where any message can be lost, any process can pause, and no clock can be fully trusted.

What it covers in depth:

  • the eight fallacies of distributed computing and why each is a bug source
  • partial failure as the single defining property of a distributed system
  • asynchrony and why a non-responding node is indistinguishable from a dead one
  • physical time: NTP, clock drift, and why wall-clock timestamps rarely order events
  • Lamport clocks and the happens-before relation as a causal ordering
  • vector clocks as the richer structure that can detect concurrent events
  • failure detection: heartbeats, timeouts, phi-accrual, and completeness vs accuracy
  • gossip protocols and SWIM-style membership under scale
  • Byzantine failures and where the trust model forces a different algorithm class
  • why consensus is needed: leader election, atomic commit, and replicated state machines
  • Paxos as the canonical single-decree consensus algorithm
  • Raft as a teachable, implementable consensus algorithm with explicit leader
  • leader election, split-brain, and fencing tokens
  • idempotency, retries, and exactly-once as an end-to-end property
  • coordination services: ZooKeeper, etcd, and Consul as shared dependencies for the above

What it deliberately does not try to finish here:

  • full formal proofs of FLP, Paxos, or Raft safety
  • peer-to-peer overlay networks (DHTs, Chord, Kademlia) beyond a sentence
  • blockchain and BFT consensus variants beyond a pointer
  • stream processing or event sourcing (later module)

If you leave this module still saying "the network is probably fine" or "we'll just timestamp the message," the module is not complete.


Before You Start

Answer these closed-book before starting the main path:

  1. You send an RPC and receive no reply for 500ms. Name at least three distinct things that could have happened.
  2. Two clients each stamp their update with their local wall-clock time. Why can the timestamps disagree about which came first even if both clocks are "correct" to the second?
  3. In a 5-node cluster with one leader, the leader is alive but partitioned from three followers. What is split-brain and why does it matter?
  4. Why is "at most once" trivial, "at least once" easy, and "exactly once" the hard one?
  5. If three services must either all commit or all abort a workflow, and the coordinator crashes halfway, why might the participants be stuck?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path and can spend less time on Cluster 1.

2-3 solid answers

  • Continue, but expect extra time in Cluster 2 (time and ordering) and Cluster 4 (consensus).

0-1 solid answers

  • Revisit Module 4 (transactions and consistency) briefly. The consensus and ordering work here only makes sense on top of a concrete mental model of what a replicated transaction is trying to guarantee.

What This Module Is For

Distributed systems are where single-machine engineering intuitions quietly break. Throughout the rest of the program you will repeatedly be asked:

  • my service replied 200 OK but the caller timed out - did the side effect happen?
  • my cluster has two nodes each claiming to be the leader - what did I configure wrong?
  • my "eventually consistent" store returned stale data after a retry - is that a bug or the contract?
  • my cross-service workflow partially failed - how do I resume without double-charging?
  • my library says it uses Raft - what am I actually getting and what am I giving up?

This module builds the reasoning needed for:

  • architecture decisions in Semester 7 (microservices, event-driven systems)
  • cloud and container orchestration topics where coordination services are background infrastructure
  • production on-call work where half the incidents are partial-failure stories

You are learning to stop waving your hands about "the network will figure it out."


Concept Map


How To Use This Module

Work in order. Later clusters only make sense if the earlier failure model is stable.

Cluster 1: The Inescapable Reality

OrderConceptTypeFocus
1The Eight Fallacies of Distributed ComputingPRIMARYDeutsch and Gosling's eight assumptions and the class of bugs each creates
2Partial Failure: The Single Defining PropertyPRIMARYWhy "some but not all" failure is the distinguishing feature of a distributed system
3Asynchrony: Slow vs Dead Is UndecidablePRIMARYWhy no timeout can safely tell you a process is dead versus merely paused

Cluster mastery check: Given a symptom ("RPC returned no answer for 2 seconds"), can you list at least five distinct failure scenarios that produce that symptom, and explain why no local observation can distinguish them?

Cluster 2: Time, Clocks, and Ordering

OrderConceptTypeFocus
4Physical vs Logical Clocks, and the Limits of NTPPRIMARYWall-clock time, monotonic time, NTP skew and drift, and why timestamps do not order events
5Lamport Clocks and Happens-BeforePRIMARYThe happens-before relation, the Lamport clock update rule, and total order from partial order
6Vector Clocks and Causal OrderingPRIMARYPer-node counters, the comparison rule, concurrent events, and why vectors can detect what Lamport cannot

Cluster mastery check: Given a three-process message trace, can you compute Lamport timestamps and vector timestamps by hand and identify which event pairs are concurrent versus causally ordered?

Cluster 3: Failure Detection and Membership

OrderConceptTypeFocus
7Heartbeats, Timeouts, and Phi-Accrual DetectorsPRIMARYCompleteness vs accuracy, fixed timeouts vs adaptive detectors, and the suspicion level
8Gossip Protocols and SWIM MembershipPRIMARYEpidemic dissemination, SWIM's direct-and-indirect probe design, and why gossip scales where broadcast does not
9Byzantine Failures and the Trust ModelSUPPORTINGCrash-stop vs omission vs Byzantine, the 3f+1 bound, and when PBFT is the right frame

Cluster mastery check: Given a production fault ("node X is marked dead by some peers but alive by others"), can you pick the right detector, adjust completeness vs accuracy, and explain why a 500ms fixed timeout is the wrong answer at cloud latencies?

Cluster 4: Consensus

OrderConceptTypeFocus
10Why Consensus Is Needed: Leader Election, Atomic Commit, Replicated State MachinesPRIMARYThe three canonical problems all reduce to consensus, plus the FLP impossibility result
11Paxos: The Canonical Single-Decree AlgorithmPRIMARYProposers, acceptors, learners, the two-phase prepare/accept protocol, and what Multi-Paxos adds
12Raft: Understandable Consensus with an Explicit LeaderPRIMARYTerms, leader election, log replication RPCs, and the safety invariants

Cluster mastery check: Given a 5-node Raft cluster and a sequence of leader crashes and partitions, can you trace which logs are committed, which are overwritten, and which node becomes the next leader?

Cluster 5: Distributed System Patterns

OrderConceptTypeFocus
13Leader Election and Split-Brain PreventionPRIMARYQuorum, fencing tokens, lease-based leadership, and why "one leader at a time" is never free
14Idempotency, Exactly-Once, and Retry SemanticsPRIMARYAt-most-once vs at-least-once vs effectively-once, idempotency keys, and the end-to-end argument
15Coordination Services: ZooKeeper, etcd, ConsulSUPPORTINGWhat coordination services actually give you (leader election, distributed locks, config, service discovery) and why you almost never write consensus yourself

Cluster mastery check: For a multi-instance worker service that must have exactly one active scheduler, can you choose between ZooKeeper, etcd, and Consul and write the leader-election pseudo-code that uses the service correctly (with fencing)?

Then work these practice pages:

OrderPractice pathFocus
1Time and Ordering LabSimulate Lamport and vector clocks on a small system and classify concurrent vs causal events
2Failure Model WorkshopMap real fault scenarios onto crash-stop, omission, Byzantine, and timing models
3Consensus Reasoning ClinicTrace Paxos and Raft runs by hand; reason about split-brain and fencing
4Distributed Systems Code KatasSimulate a toy Raft election, design an idempotent HTTP API, analyze a real postmortem

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Name all eight fallacies of distributed computing and give one real incident or code pattern that exhibits each.
  2. Define partial failure precisely and explain why it makes local reasoning insufficient.
  3. Argue why no timeout can reliably distinguish a dead process from a slow one, and connect this to the FLP impossibility result.
  4. Explain the difference between physical wall-clock time, monotonic time, and logical time, and describe at least two real bugs caused by relying on wall-clock ordering.
  5. Compute Lamport timestamps for a three-process message trace by hand, and state the guarantee and the limitation of Lamport clocks.
  6. Compute vector timestamps for the same trace and use them to detect concurrent events, and explain when vector clocks are worth the extra metadata.
  7. Describe how heartbeats, fixed timeouts, and phi-accrual detectors trade completeness against accuracy, and tune a detector for a given workload.
  8. Describe how gossip and SWIM disseminate membership information in O(log n) rounds and explain why this scales where broadcast does not.
  9. Distinguish crash-stop, omission, and Byzantine failure models and justify when each applies in practice.
  10. Explain why leader election, atomic commit, and replicated state machines all reduce to consensus, and state the FLP impossibility result in plain English.
  11. Walk through the Paxos prepare/accept phases on a small example, including a case where a proposer's value is overridden.
  12. Walk through a Raft leader election and log replication RPC, and name the safety invariants that protect committed entries.
  13. Design leader election with split-brain prevention using fencing tokens, and explain why a stale leader's writes must be rejected.
  14. Distinguish at-most-once, at-least-once, and effectively-once semantics, and design an idempotent HTTP API using an idempotency key.
  15. Explain what a coordination service (ZooKeeper, etcd, Consul) actually provides, and pick one appropriately for a given problem without reimplementing consensus.

Outputs

  • a written eight-fallacies checklist: for each fallacy, one production incident or code pattern that would exhibit it
  • a message-trace diagram annotated with Lamport timestamps and vector timestamps, and a labeled list of which event pairs are concurrent
  • a phi-accrual tuning note: you pick a workload, reason about detection time vs false-positive rate, and pick thresholds with justification
  • a Raft trace on paper: 5 nodes, one leader crash, one network partition, one re-election, with terms and logs drawn at every step
  • a 2-page idempotent API design memo: request flow, idempotency key lifecycle, retry semantics, and the end-to-end argument applied
  • a postmortem analysis of a real outage (Cloudflare, AWS S3, GitHub, GitLab, etc.) labeled with the distributed-systems concepts it exhibits
  • a leader-election pseudo-code snippet using etcd or ZooKeeper primitives with explicit fencing
  • a mistake log (at least 8 entries) on fallacies you violated, clock confusions, split-brain scenarios, and exactly-once myths

Completion Standard

You have completed Module 5 when all of these are true:

  • you can enumerate the eight fallacies without looking them up
  • you can argue why a timeout is a guess and connect this to FLP
  • you can compute Lamport and vector timestamps by hand on a three-process trace
  • you have written Raft's leader-election and log-replication rules from memory at a level of detail that a peer could implement
  • you have traced at least one Paxos run and at least one Raft run including a re-election
  • you can design an idempotent API with an idempotency key and justify it in end-to-end terms
  • you have analyzed a real outage and labeled the distributed-systems concepts it exhibits
  • you no longer say "exactly-once delivery" as if it were free

If you are still writing "we'll just retry on timeout" without a correctness argument, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • External validated links (raft.github.io, the Raft paper, Lamport's site, Aphyr's Jepsen posts) are targeted; read them when the concept page points to them.
  • Because this module underpins every architecture and operations topic after Semester 6, hand-drawn clock diagrams and hand-traced Raft runs are required, not optional.

Suggested Weekly Flow

DayWork
1Concepts 1-3; write the eight-fallacies incident checklist
2Concepts 4-6; draw a 3-process message trace with Lamport and vector timestamps
3Concepts 7-9; tune a phi-accrual detector on paper for one workload
4Concept 10 and Practice 1 (time and ordering lab)
5Concept 11; trace one Paxos run by hand
6Concept 12; trace one Raft election and a log-replication RPC
7Concepts 13-15 and Practice 2 (failure model workshop)
8Practice 3 (consensus reasoning clinic)
9Practice 4 (code katas), including postmortem analysis
10Quiz and mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X — elective

The Consensus / Raft tutorial is the most challenging distributed-systems project at this level — and the one that pays back the most. Pair with the Blockchain tutorial for a Nakamoto-consensus contrast (different family, same problem). See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread