Module 5: Distributed Systems Fundamentals

Primary text: Distributed Systems: Concepts and Design (Coulouris et al., 5th ed.) chapters 1-2, 14-15, 18 Selective support: Designing Data-Intensive Applications (Kleppmann) chapters 8 and 9, Database Internals (Petrov) Part II chapters 8-14 for consensus, failure detection, and gossip; Ongaro's Raft paper for one canonical walkthrough

This module is where you stop treating the network as reliable, stop treating time as a shared truth, and start reasoning about partial failure as the defining property of distributed systems. You already know replication (Module 3), transactions, and consistency (Module 4). Here you learn the mechanisms that make those ideas safe under asynchrony and crashes: logical time, failure detection, and consensus.

Scope of This Module

This module is not "a catalog of distributed algorithms." It is where you build the vocabulary and reasoning habits to argue about correctness in a system where any message can be lost, any process can pause, and no clock can be fully trusted.

What it covers in depth:

the eight fallacies of distributed computing and why each is a bug source
partial failure as the single defining property of a distributed system
asynchrony and why a non-responding node is indistinguishable from a dead one
physical time: NTP, clock drift, and why wall-clock timestamps rarely order events
Lamport clocks and the happens-before relation as a causal ordering
vector clocks as the richer structure that can detect concurrent events
failure detection: heartbeats, timeouts, phi-accrual, and completeness vs accuracy
gossip protocols and SWIM-style membership under scale
Byzantine failures and where the trust model forces a different algorithm class
why consensus is needed: leader election, atomic commit, and replicated state machines
Paxos as the canonical single-decree consensus algorithm
Raft as a teachable, implementable consensus algorithm with explicit leader
leader election, split-brain, and fencing tokens
idempotency, retries, and exactly-once as an end-to-end property
coordination services: ZooKeeper, etcd, and Consul as shared dependencies for the above

What it deliberately does not try to finish here:

full formal proofs of FLP, Paxos, or Raft safety
peer-to-peer overlay networks (DHTs, Chord, Kademlia) beyond a sentence
blockchain and BFT consensus variants beyond a pointer
stream processing or event sourcing (later module)

If you leave this module still saying "the network is probably fine" or "we'll just timestamp the message," the module is not complete.

Before You Start

Answer these closed-book before starting the main path:

You send an RPC and receive no reply for 500ms. Name at least three distinct things that could have happened.
Two clients each stamp their update with their local wall-clock time. Why can the timestamps disagree about which came first even if both clocks are "correct" to the second?
In a 5-node cluster with one leader, the leader is alive but partitioned from three followers. What is split-brain and why does it matter?
Why is "at most once" trivial, "at least once" easy, and "exactly once" the hard one?
If three services must either all commit or all abort a workflow, and the coordinator crashes halfway, why might the participants be stuck?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path and can spend less time on Cluster 1.

2-3 solid answers

Continue, but expect extra time in Cluster 2 (time and ordering) and Cluster 4 (consensus).

0-1 solid answers

Revisit Module 4 (transactions and consistency) briefly. The consensus and ordering work here only makes sense on top of a concrete mental model of what a replicated transaction is trying to guarantee.

What This Module Is For

Distributed systems are where single-machine engineering intuitions quietly break. Throughout the rest of the program you will repeatedly be asked:

my service replied 200 OK but the caller timed out - did the side effect happen?
my cluster has two nodes each claiming to be the leader - what did I configure wrong?
my "eventually consistent" store returned stale data after a retry - is that a bug or the contract?
my cross-service workflow partially failed - how do I resume without double-charging?
my library says it uses Raft - what am I actually getting and what am I giving up?

This module builds the reasoning needed for:

architecture decisions in Semester 7 (microservices, event-driven systems)
cloud and container orchestration topics where coordination services are background infrastructure
production on-call work where half the incidents are partial-failure stories

You are learning to stop waving your hands about "the network will figure it out."

Concept Map

How To Use This Module

Work in order. Later clusters only make sense if the earlier failure model is stable.

Cluster 1: The Inescapable Reality

Order	Concept	Type	Focus
1	The Eight Fallacies of Distributed Computing	PRIMARY	Deutsch and Gosling's eight assumptions and the class of bugs each creates
2	Partial Failure: The Single Defining Property	PRIMARY	Why "some but not all" failure is the distinguishing feature of a distributed system
3	Asynchrony: Slow vs Dead Is Undecidable	PRIMARY	Why no timeout can safely tell you a process is dead versus merely paused

Cluster mastery check: Given a symptom ("RPC returned no answer for 2 seconds"), can you list at least five distinct failure scenarios that produce that symptom, and explain why no local observation can distinguish them?

Cluster 2: Time, Clocks, and Ordering

Order	Concept	Type	Focus
4	Physical vs Logical Clocks, and the Limits of NTP	PRIMARY	Wall-clock time, monotonic time, NTP skew and drift, and why timestamps do not order events
5	Lamport Clocks and Happens-Before	PRIMARY	The happens-before relation, the Lamport clock update rule, and total order from partial order
6	Vector Clocks and Causal Ordering	PRIMARY	Per-node counters, the comparison rule, concurrent events, and why vectors can detect what Lamport cannot

Cluster mastery check: Given a three-process message trace, can you compute Lamport timestamps and vector timestamps by hand and identify which event pairs are concurrent versus causally ordered?

Cluster 3: Failure Detection and Membership

Order	Concept	Type	Focus
7	Heartbeats, Timeouts, and Phi-Accrual Detectors	PRIMARY	Completeness vs accuracy, fixed timeouts vs adaptive detectors, and the suspicion level
8	Gossip Protocols and SWIM Membership	PRIMARY	Epidemic dissemination, SWIM's direct-and-indirect probe design, and why gossip scales where broadcast does not
9	Byzantine Failures and the Trust Model	SUPPORTING	Crash-stop vs omission vs Byzantine, the 3f+1 bound, and when PBFT is the right frame

Cluster mastery check: Given a production fault ("node X is marked dead by some peers but alive by others"), can you pick the right detector, adjust completeness vs accuracy, and explain why a 500ms fixed timeout is the wrong answer at cloud latencies?

Cluster 4: Consensus

Order	Concept	Type	Focus
10	Why Consensus Is Needed: Leader Election, Atomic Commit, Replicated State Machines	PRIMARY	The three canonical problems all reduce to consensus, plus the FLP impossibility result
11	Paxos: The Canonical Single-Decree Algorithm	PRIMARY	Proposers, acceptors, learners, the two-phase prepare/accept protocol, and what Multi-Paxos adds
12	Raft: Understandable Consensus with an Explicit Leader	PRIMARY	Terms, leader election, log replication RPCs, and the safety invariants

Cluster mastery check: Given a 5-node Raft cluster and a sequence of leader crashes and partitions, can you trace which logs are committed, which are overwritten, and which node becomes the next leader?

Cluster 5: Distributed System Patterns

Order	Concept	Type	Focus
13	Leader Election and Split-Brain Prevention	PRIMARY	Quorum, fencing tokens, lease-based leadership, and why "one leader at a time" is never free
14	Idempotency, Exactly-Once, and Retry Semantics	PRIMARY	At-most-once vs at-least-once vs effectively-once, idempotency keys, and the end-to-end argument
15	Coordination Services: ZooKeeper, etcd, Consul	SUPPORTING	What coordination services actually give you (leader election, distributed locks, config, service discovery) and why you almost never write consensus yourself

Cluster mastery check: For a multi-instance worker service that must have exactly one active scheduler, can you choose between ZooKeeper, etcd, and Consul and write the leader-election pseudo-code that uses the service correctly (with fencing)?

Then work these practice pages:

Order	Practice path	Focus
1	Time and Ordering Lab	Simulate Lamport and vector clocks on a small system and classify concurrent vs causal events
2	Failure Model Workshop	Map real fault scenarios onto crash-stop, omission, Byzantine, and timing models
3	Consensus Reasoning Clinic	Trace Paxos and Raft runs by hand; reason about split-brain and fencing
4	Distributed Systems Code Katas	Simulate a toy Raft election, design an idempotent HTTP API, analyze a real postmortem

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Name all eight fallacies of distributed computing and give one real incident or code pattern that exhibits each.
Define partial failure precisely and explain why it makes local reasoning insufficient.
Argue why no timeout can reliably distinguish a dead process from a slow one, and connect this to the FLP impossibility result.
Explain the difference between physical wall-clock time, monotonic time, and logical time, and describe at least two real bugs caused by relying on wall-clock ordering.
Compute Lamport timestamps for a three-process message trace by hand, and state the guarantee and the limitation of Lamport clocks.
Compute vector timestamps for the same trace and use them to detect concurrent events, and explain when vector clocks are worth the extra metadata.
Describe how heartbeats, fixed timeouts, and phi-accrual detectors trade completeness against accuracy, and tune a detector for a given workload.
Describe how gossip and SWIM disseminate membership information in O(log n) rounds and explain why this scales where broadcast does not.
Distinguish crash-stop, omission, and Byzantine failure models and justify when each applies in practice.
Explain why leader election, atomic commit, and replicated state machines all reduce to consensus, and state the FLP impossibility result in plain English.
Walk through the Paxos prepare/accept phases on a small example, including a case where a proposer's value is overridden.
Walk through a Raft leader election and log replication RPC, and name the safety invariants that protect committed entries.
Design leader election with split-brain prevention using fencing tokens, and explain why a stale leader's writes must be rejected.
Distinguish at-most-once, at-least-once, and effectively-once semantics, and design an idempotent HTTP API using an idempotency key.
Explain what a coordination service (ZooKeeper, etcd, Consul) actually provides, and pick one appropriately for a given problem without reimplementing consensus.

Outputs

a written eight-fallacies checklist: for each fallacy, one production incident or code pattern that would exhibit it
a message-trace diagram annotated with Lamport timestamps and vector timestamps, and a labeled list of which event pairs are concurrent
a phi-accrual tuning note: you pick a workload, reason about detection time vs false-positive rate, and pick thresholds with justification
a Raft trace on paper: 5 nodes, one leader crash, one network partition, one re-election, with terms and logs drawn at every step
a 2-page idempotent API design memo: request flow, idempotency key lifecycle, retry semantics, and the end-to-end argument applied
a postmortem analysis of a real outage (Cloudflare, AWS S3, GitHub, GitLab, etc.) labeled with the distributed-systems concepts it exhibits
a leader-election pseudo-code snippet using etcd or ZooKeeper primitives with explicit fencing
a mistake log (at least 8 entries) on fallacies you violated, clock confusions, split-brain scenarios, and exactly-once myths

Completion Standard

You have completed Module 5 when all of these are true:

you can enumerate the eight fallacies without looking them up
you can argue why a timeout is a guess and connect this to FLP
you can compute Lamport and vector timestamps by hand on a three-process trace
you have written Raft's leader-election and log-replication rules from memory at a level of detail that a peer could implement
you have traced at least one Paxos run and at least one Raft run including a re-election
you can design an idempotent API with an idempotency key and justify it in end-to-end terms
you have analyzed a real outage and labeled the distributed-systems concepts it exhibits
you no longer say "exactly-once delivery" as if it were free

If you are still writing "we'll just retry on timeout" without a correctness argument, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuck means try the concept page, self-check, and drill first.
External validated links (raft.github.io, the Raft paper, Lamport's site, Aphyr's Jepsen posts) are targeted; read them when the concept page points to them.
Because this module underpins every architecture and operations topic after Semester 6, hand-drawn clock diagrams and hand-traced Raft runs are required, not optional.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3; write the eight-fallacies incident checklist
2	Concepts 4-6; draw a 3-process message trace with Lamport and vector timestamps
3	Concepts 7-9; tune a phi-accrual detector on paper for one workload
4	Concept 10 and Practice 1 (time and ordering lab)
5	Concept 11; trace one Paxos run by hand
6	Concept 12; trace one Raft election and a log-replication RPC
7	Concepts 13-15 and Practice 2 (failure model workshop)
8	Practice 3 (consensus reasoning clinic)
9	Practice 4 (code katas), including postmortem analysis
10	Quiz and mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Build Your Own X — elective

The Consensus / Raft tutorial is the most challenging distributed-systems project at this level — and the one that pays back the most. Pair with the Blockchain tutorial for a Nakamoto-consensus contrast (different family, same problem). See Build Your Own X overview.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: The Inescapable Reality​

Cluster 2: Time, Clocks, and Ordering​

Cluster 3: Failure Detection and Membership​

Cluster 4: Consensus​

Cluster 5: Distributed System Patterns​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​