Module 5: Distributed Systems Fundamentals: Case Studies
These case studies are about the part of engineering that unit tests rarely teach: messages arrive late, clocks disagree, leaders pause, retries multiply load, and partitions make "just keep serving" an unsafe requirement. Work through each case by drawing the failure model first, then the mechanism.
How To Use These Case Studies
- State the failure model before proposing a fix.
- Draw the timeline, partition, quorum, or message graph.
- Name the mechanism: timeout, logical clock, vector clock, quorum, consensus, fencing token, idempotency key, or backoff.
- Identify what is guaranteed and what is only best effort.
- Produce the required artifact and connect it to a capstone operation.
Case Study 1: GitHub's Network Partition And The Cost Of Split Databases
Scenario: A data-center network event separates primary and secondary database sites. Writes continue in a degraded topology. When connectivity returns, the system must reconcile database state, delay normal service, and protect data correctness before fully recovering.
Source anchor: GitHub's October 21, 2018 post-incident analysis describes a network partition between US East Coast facilities, database failover complications, and a long recovery process to preserve data integrity. See GitHub October 21 post-incident analysis.
Firecrawl result used: the GitHub postmortem as the top result for GitHub October 21 2018 network partition postmortem.
Module concepts:
- partial failure
- network partition
- split-brain risk
- failover
- recovery over raw availability
- operational runbook
Wrong Approach
"If a region is unreachable, promote the other side and keep serving all writes."
That can preserve short-term availability while creating divergent histories that are expensive or impossible to merge safely.
Better Approach
Write the failover policy around data safety:
During partition:
identify which side has write authority
fence old leaders
restrict unsafe writes if authority is ambiguous
After partition heals:
stop automatic churn
compare replication positions
decide recovery order
reconcile or rebuild replicas
communicate degraded modes
The mature move is not "always stay writable." It is knowing which operations must stop when the system cannot prove a single authority.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| keep both sides writable | high immediate availability | divergent state and reconciliation risk |
| single write authority | preserves a clearer history | some users lose write availability |
| read-only degraded mode | protects data while serving partial value | product capability is reduced |
| manual recovery gates | avoids automated damage | slower recovery and operational load |
Failure Mode
The system recovers network connectivity but not logical consistency. The outage continues because data histories diverged.
Required Artifact
Write a partition runbook:
Authority source:
Fence mechanism:
Writes allowed:
Reads allowed:
Replication-position check:
Recovery order:
User-visible mode:
Abort condition:
Project / Capstone Connection
Any capstone with replicas, regions, or failover should include a "who may write during partition?" answer.
Case Study 2: Raft Majority Prevents Two Leaders From Committing
Scenario: A five-node coordination cluster partitions into {N1, N2} and {N3, N4, N5}. 1 was leader before the partition. Clients can still reach 1, so the team expects 1` to continue accepting writes.
Source anchor: The Raft paper explains leader election, terms, replicated logs, and the majority rule that allows a leader to commit log entries only after replication to a majority. See In Search of an Understandable Consensus Algorithm.
Firecrawl result used: the Raft paper from raft.github.io for site:raft.github.io raft consensus paper.
Module concepts:
- consensus
- leader election
- quorum
- replicated log
- partition tolerance
- committed vs uncommitted entries
Wrong Approach
"A node that still believes it is leader can keep committing writes for its clients."
Belief is not enough. Commitment requires a majority. The minority-side leader may append local entries, but it cannot safely commit them.
Better Approach
Reason from quorum intersection:
Cluster size: 5
Majority: 3
Partition A: N1, N2
Partition B: N3, N4, N5
N1 side:
can receive client writes
cannot replicate to majority
cannot commit new entries
N3/N4/N5 side:
can elect a leader
can commit entries with majority
When the partition heals, uncommitted minority entries can be overwritten by the new leader's log.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| require majority commit | prevents split-brain commits | minority side loses write availability |
| allow minority writes | local availability | conflicting committed histories |
| increase cluster size | tolerate more failures | more coordination and operational cost |
| colocate quorum carefully | better availability model | placement complexity |
Failure Mode
The minority leader returns success for writes that later disappear, because they were never committed by a quorum.
Required Artifact
Draw a Raft partition trace:
Initial leader:
Partition sets:
Majority size:
Can old leader commit?:
Can new leader be elected?:
Which entries survive after heal?:
Client response rule:
Project / Capstone Connection
If a capstone uses a coordination system, distinguish "accepted by a node" from "committed by quorum."
Case Study 3: etcd Disaster Recovery And Quorum Math
Scenario: A Kubernetes control-plane dependency runs on a three-member etcd cluster. Two members are lost after disk corruption. An operator expects the remaining member to keep the cluster alive because one copy of the data still exists.
Source anchor: etcd disaster recovery docs state that a cluster tolerates up to (N-1)/2 permanent failures for ` members, and that recovery from a snapshot creates a new logical cluster. See etcd disaster recovery.
Firecrawl result used: the official etcd disaster recovery page for site:etcd.io docs raft leader election quorum official docs etcd disaster recovery.
Module concepts:
- quorum
- failure tolerance
- consensus availability
- snapshot recovery
- cluster identity
- disaster recovery
Wrong Approach
"One remaining node means the data still exists, so the cluster can keep operating."
Consensus systems need a quorum, not merely any surviving copy. A single node from a three-node cluster cannot prove it is the current authority after losing the other two.
Better Approach
Plan recovery around quorum and snapshots:
Normal 3-member cluster:
tolerates 1 permanent failure
requires 2 members for quorum
After losing 2 members:
no quorum
normal writes stop
restore from snapshot into a new logical cluster
verify revision/compaction expectations for clients
Backup frequency is part of the availability story. Without snapshots, "consensus protects us" is incomplete.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| 3 members | simple and common | tolerates only one permanent failure |
| 5 members | tolerates two permanent failures | more nodes and slower quorum path |
| snapshot restore | recovers from quorum loss | possible data loss since snapshot |
| unsafe single-node continuation | apparent availability | violates consensus authority assumptions |
Failure Mode
The operator turns disaster recovery into a split-brain event by forcing a lone member to act as the old cluster.
Required Artifact
Write a quorum/recovery worksheet:
Cluster size:
Quorum size:
Permanent failures tolerated:
Snapshot interval:
Maximum data loss:
Restore procedure:
Client reconfiguration:
Validation checks:
Project / Capstone Connection
If a capstone relies on etcd, ZooKeeper, Consul, or a Raft service, include quorum math and backup/restore evidence.
Case Study 4: Retry Storm After A Partial Outage
Scenario: A payment API slows down because a downstream risk service is overloaded. Clients time out at 500 ms and retry immediately. The retry traffic doubles load, causes more timeouts, and spreads the incident to healthy services.
Source anchor: Amazon Builders' Library explains that retries are selfish, can amplify overload, and need timeouts, capped retries, exponential backoff, jitter, and idempotency for side-effecting APIs. See Timeouts, retries, and backoff with jitter and Making retries safe with idempotent APIs.
Firecrawl result used: both AWS Builders' Library articles for targeted one-result searches on retries/backoff and idempotent APIs.
Module concepts:
- partial failure
- timeout
- retry amplification
- exponential backoff
- jitter
- idempotency
- overload control
Wrong Approach
"If a call times out, retry immediately until it succeeds."
Timeout does not prove the operation failed. It proves the caller stopped waiting. Immediate retry can duplicate side effects and increase load on the already-slow dependency.
Better Approach
Design retry as a load-control mechanism:
Timeout:
based on downstream latency distribution and caller budget
Retry:
bounded attempts
exponential backoff
jitter
idempotency key for side effects
Failure:
degrade or fail closed after budget is exhausted
Retry policy belongs in the API contract and client library, not as scattered loops around HTTP calls.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no retries | avoids amplification | transient faults become user failures |
| immediate retries | quick recovery for rare drops | overload and duplicate effects |
| backoff with jitter | smoother load during incidents | higher tail latency |
| idempotency keys | safe side-effect retry | requires storage and request identity rules |
Failure Mode
The retry policy turns a small downstream slowdown into a system-wide outage.
Required Artifact
Write a retry budget:
Operation:
Side effect?:
Timeout:
Max attempts:
Backoff:
Jitter:
Idempotency key:
Circuit/degrade rule:
Metrics:
Project / Capstone Connection
Every capstone API that calls another service should include a timeout/retry/idempotency contract.
Case Study 5: Wall-Clock Timestamp Used As Causal Truth
Scenario: A collaborative editor accepts updates from multiple regions. The conflict resolver picks the update with the largest updated_at timestamp. One region's clock jumps forward after time synchronization. Its stale edit wins over a causally later edit from another region.
Source anchor: Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" defines happens-before and logical clocks for reasoning about event order without relying on synchronized physical clocks. AWS EC2 documentation describes time synchronization services, but clock synchronization still provides a time reference, not a causal-ordering proof. See Lamport time/clocks paper and AWS EC2 time synchronization.
Firecrawl result used: Lamport's paper and AWS EC2 time-sync docs from targeted searches.
Module concepts:
- wall-clock time
- clock skew
- happens-before
- Lamport clocks
- vector clocks
- concurrent updates
Wrong Approach
"NTP keeps clocks synced, so timestamps tell us which edit happened last."
Physical clocks can move, skew, and disagree. Even accurate clocks do not encode message causality unless the system explicitly models it.
Better Approach
Choose the right ordering tool:
Need total display order:
Lamport clock + tie-breaker may be enough
Need detect concurrent edits:
vector clock or version vector
Need real-time external order:
use a database/system that explicitly provides it
Need merge user edits:
CRDT/OT or domain-specific merge rule
Timestamp ordering can be acceptable for approximate UI sorting. It is unsafe as a correctness mechanism for causality.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| wall-clock last-write-wins | simple | loses causality and can discard valid edits |
| Lamport clock | captures happens-before order | cannot detect all concurrency by itself |
| vector clock | detects concurrent histories | metadata grows with participants |
| CRDT/OT merge | preserves concurrent work | more complex data model |
Failure Mode
The system silently overwrites a user's newer work because "newer timestamp" was confused with "causally newer."
Required Artifact
Draw a three-node event trace:
Node A events:
Node B events:
Messages:
Wall-clock timestamps:
Lamport timestamps:
Vector timestamps:
Concurrent pairs:
Conflict rule:
Project / Capstone Connection
Any capstone with offline edits, multi-region writes, chat, document collaboration, or event ordering needs a real ordering model.
Case Study 6: CAP During A Real Partition, Not As A Slogan
Scenario: A profile service is replicated across two regions. During a network partition, both regions can still serve users. Product asks engineering to keep profile updates available everywhere and still guarantee that every read sees the latest write.
Source anchor: Gilbert and Lynch's CAP proof formalizes the tension between consistency, availability, and partition tolerance in an asynchronous network model. The practical lesson is not "pick two forever"; it is that under a partition, a system that must preserve strong consistency may have to reject or delay some operations. See Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.
Firecrawl result used: the Gilbert-Lynch CAP paper for CAP theorem Brewer Gilbert Lynch paper PDF consistency availability partition tolerance.
Module concepts:
- CAP
- partition tolerance
- availability
- linearizability
- degraded mode
- consistency contract
Wrong Approach
"CAP says we picked AP, so stale or conflicting updates are fine everywhere."
That is too blunt. The correct design can be per operation. Some writes may be unavailable under partition while other reads or low-risk updates continue.
Better Approach
Classify operations:
Strongly consistent:
password changes
permission changes
billing profile
Available/degraded:
display name
avatar
preferences
Conflict-resolved:
bio text
notification settings
For each operation, define partition behavior and user-facing state. Do not use "eventual consistency" as a blanket excuse for unsafe writes.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| reject sensitive writes during partition | preserves strong invariant | reduced availability |
| accept all writes in both regions | high local availability | conflict resolution and stale reads |
| per-operation consistency policy | matches business risk | more design and implementation work |
| read-only degraded mode | simple safety posture | users cannot complete some workflows |
Failure Mode
The service stays "available" by accepting conflicting permission or billing updates that require manual cleanup.
Required Artifact
Write a consistency matrix:
Operation:
Partition behavior:
Read freshness:
Conflict policy:
User-facing message:
Reconciliation path:
Metric/alert:
Project / Capstone Connection
If a capstone mentions CAP, include an operation-level matrix instead of a one-line CP/AP label.
Case Study 7: Stale Read In A Consensus-Backed Store
Scenario: A team uses etcd because it is "Raft-backed." They assume every read is automatically linearizable. A client performs a read from a member that is partitioned from the leader or is serving through a path that does not verify current leadership, and receives stale data.
Source anchor: Jepsen's etcd 3.4.3 analysis describes etcd as a Raft-based distributed database and examines stale-read behavior and consistency under faults. Use it as a reminder that real client APIs and read modes matter even when the core replication algorithm is strong. See Jepsen: etcd 3.4.3.
Firecrawl result used: the Jepsen etcd analysis for site:jepsen.io analyses etcd 3.4 raft consensus partitions stale reads.
Module concepts:
- linearizable read
- stale read
- Raft
- leader lease
- client API contract
- verification under fault
Wrong Approach
"Raft system means every operation is automatically linearizable."
Consensus is a mechanism inside the system. The client-visible guarantee depends on the API path, read mode, leader checks, leases, and implementation.
Better Approach
Read the client contract and test the fault mode:
For every read path:
is it served by leader?
does it go through quorum/ReadIndex?
does it rely on a lease?
what happens during partition or clock pause?
Then write tests that inject partitions and pauses while checking the exact history property the application needs.
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| local/stale read | low latency | may violate freshness assumptions |
| leader linearizable read | strong semantics | more coordination and latency |
| quorum read path | avoids stale followers | higher read cost |
| Jepsen-style fault testing | validates actual behavior | test complexity and time |
Failure Mode
The architecture is correct on paper, but the application uses a weaker API path than the design assumes.
Required Artifact
Write a read-contract test plan:
Read API:
Expected consistency:
Leader/follower behavior:
Partition injected:
Pause injected:
History checker:
Acceptable stale window:
Failure evidence:
Project / Capstone Connection
If your capstone uses a consensus-backed store for locks, config, or leader election, test the exact read/write API semantics.
Source Map
| Source | Use it for |
|---|---|
| GitHub October 21 post-incident analysis | network partition, failover, database recovery, operational tradeoffs |
| Raft paper | leader election, replicated logs, quorum commit, uncommitted entries |
| etcd disaster recovery | quorum failure tolerance and snapshot-based recovery |
| AWS timeouts, retries, and backoff with jitter | retry amplification, backoff, jitter, timeout design |
| AWS idempotent APIs | safe retries for side-effecting APIs |
| Lamport time/clocks paper | happens-before, logical clocks, event ordering |
| AWS EC2 time synchronization | physical time synchronization as a time reference, not causal proof |
| Gilbert-Lynch CAP paper | consistency, availability, partition tolerance tradeoff under partition |
| Jepsen: etcd 3.4.3 | consensus-backed store behavior under faults and stale-read analysis |
Completion Standard
- At least three case-study artifacts are completed.
- At least one artifact includes a partition or quorum diagram.
- At least one artifact includes a timeout/retry/idempotency budget.
- At least one artifact includes a logical-clock or vector-clock trace.
- At least one artifact includes an operation-level CAP/consistency matrix.
- At least one artifact tests the actual API semantics of a consensus-backed service.
- At least one case connects back to Module 3 replication/partitioning and Module 4 transactions/consistency.