Skip to main content

Module 5: Distributed Systems Fundamentals: Case Studies

These case studies are about the part of engineering that unit tests rarely teach: messages arrive late, clocks disagree, leaders pause, retries multiply load, and partitions make "just keep serving" an unsafe requirement. Work through each case by drawing the failure model first, then the mechanism.


How To Use These Case Studies

  1. State the failure model before proposing a fix.
  2. Draw the timeline, partition, quorum, or message graph.
  3. Name the mechanism: timeout, logical clock, vector clock, quorum, consensus, fencing token, idempotency key, or backoff.
  4. Identify what is guaranteed and what is only best effort.
  5. Produce the required artifact and connect it to a capstone operation.

Case Study 1: GitHub's Network Partition And The Cost Of Split Databases

Scenario: A data-center network event separates primary and secondary database sites. Writes continue in a degraded topology. When connectivity returns, the system must reconcile database state, delay normal service, and protect data correctness before fully recovering.

Source anchor: GitHub's October 21, 2018 post-incident analysis describes a network partition between US East Coast facilities, database failover complications, and a long recovery process to preserve data integrity. See GitHub October 21 post-incident analysis.

Firecrawl result used: the GitHub postmortem as the top result for GitHub October 21 2018 network partition postmortem.

Module concepts:

  • partial failure
  • network partition
  • split-brain risk
  • failover
  • recovery over raw availability
  • operational runbook

Wrong Approach

"If a region is unreachable, promote the other side and keep serving all writes."

That can preserve short-term availability while creating divergent histories that are expensive or impossible to merge safely.

Better Approach

Write the failover policy around data safety:

During partition:
identify which side has write authority
fence old leaders
restrict unsafe writes if authority is ambiguous

After partition heals:
stop automatic churn
compare replication positions
decide recovery order
reconcile or rebuild replicas
communicate degraded modes

The mature move is not "always stay writable." It is knowing which operations must stop when the system cannot prove a single authority.

Tradeoff Table

ChoiceGainCost
keep both sides writablehigh immediate availabilitydivergent state and reconciliation risk
single write authoritypreserves a clearer historysome users lose write availability
read-only degraded modeprotects data while serving partial valueproduct capability is reduced
manual recovery gatesavoids automated damageslower recovery and operational load

Failure Mode

The system recovers network connectivity but not logical consistency. The outage continues because data histories diverged.

Required Artifact

Write a partition runbook:

Authority source:
Fence mechanism:
Writes allowed:
Reads allowed:
Replication-position check:
Recovery order:
User-visible mode:
Abort condition:

Project / Capstone Connection

Any capstone with replicas, regions, or failover should include a "who may write during partition?" answer.


Case Study 2: Raft Majority Prevents Two Leaders From Committing

Scenario: A five-node coordination cluster partitions into {N1, N2} and {N3, N4, N5}. 1 was leader before the partition. Clients can still reach 1, so the team expects 1` to continue accepting writes.

Source anchor: The Raft paper explains leader election, terms, replicated logs, and the majority rule that allows a leader to commit log entries only after replication to a majority. See In Search of an Understandable Consensus Algorithm.

Firecrawl result used: the Raft paper from raft.github.io for site:raft.github.io raft consensus paper.

Module concepts:

  • consensus
  • leader election
  • quorum
  • replicated log
  • partition tolerance
  • committed vs uncommitted entries

Wrong Approach

"A node that still believes it is leader can keep committing writes for its clients."

Belief is not enough. Commitment requires a majority. The minority-side leader may append local entries, but it cannot safely commit them.

Better Approach

Reason from quorum intersection:

Cluster size: 5
Majority: 3
Partition A: N1, N2
Partition B: N3, N4, N5

N1 side:
can receive client writes
cannot replicate to majority
cannot commit new entries

N3/N4/N5 side:
can elect a leader
can commit entries with majority

When the partition heals, uncommitted minority entries can be overwritten by the new leader's log.

Tradeoff Table

ChoiceGainCost
require majority commitprevents split-brain commitsminority side loses write availability
allow minority writeslocal availabilityconflicting committed histories
increase cluster sizetolerate more failuresmore coordination and operational cost
colocate quorum carefullybetter availability modelplacement complexity

Failure Mode

The minority leader returns success for writes that later disappear, because they were never committed by a quorum.

Required Artifact

Draw a Raft partition trace:

Initial leader:
Partition sets:
Majority size:
Can old leader commit?:
Can new leader be elected?:
Which entries survive after heal?:
Client response rule:

Project / Capstone Connection

If a capstone uses a coordination system, distinguish "accepted by a node" from "committed by quorum."


Case Study 3: etcd Disaster Recovery And Quorum Math

Scenario: A Kubernetes control-plane dependency runs on a three-member etcd cluster. Two members are lost after disk corruption. An operator expects the remaining member to keep the cluster alive because one copy of the data still exists.

Source anchor: etcd disaster recovery docs state that a cluster tolerates up to (N-1)/2 permanent failures for ` members, and that recovery from a snapshot creates a new logical cluster. See etcd disaster recovery.

Firecrawl result used: the official etcd disaster recovery page for site:etcd.io docs raft leader election quorum official docs etcd disaster recovery.

Module concepts:

  • quorum
  • failure tolerance
  • consensus availability
  • snapshot recovery
  • cluster identity
  • disaster recovery

Wrong Approach

"One remaining node means the data still exists, so the cluster can keep operating."

Consensus systems need a quorum, not merely any surviving copy. A single node from a three-node cluster cannot prove it is the current authority after losing the other two.

Better Approach

Plan recovery around quorum and snapshots:

Normal 3-member cluster:
tolerates 1 permanent failure
requires 2 members for quorum

After losing 2 members:
no quorum
normal writes stop
restore from snapshot into a new logical cluster
verify revision/compaction expectations for clients

Backup frequency is part of the availability story. Without snapshots, "consensus protects us" is incomplete.

Tradeoff Table

ChoiceGainCost
3 memberssimple and commontolerates only one permanent failure
5 memberstolerates two permanent failuresmore nodes and slower quorum path
snapshot restorerecovers from quorum losspossible data loss since snapshot
unsafe single-node continuationapparent availabilityviolates consensus authority assumptions

Failure Mode

The operator turns disaster recovery into a split-brain event by forcing a lone member to act as the old cluster.

Required Artifact

Write a quorum/recovery worksheet:

Cluster size:
Quorum size:
Permanent failures tolerated:
Snapshot interval:
Maximum data loss:
Restore procedure:
Client reconfiguration:
Validation checks:

Project / Capstone Connection

If a capstone relies on etcd, ZooKeeper, Consul, or a Raft service, include quorum math and backup/restore evidence.


Case Study 4: Retry Storm After A Partial Outage

Scenario: A payment API slows down because a downstream risk service is overloaded. Clients time out at 500 ms and retry immediately. The retry traffic doubles load, causes more timeouts, and spreads the incident to healthy services.

Source anchor: Amazon Builders' Library explains that retries are selfish, can amplify overload, and need timeouts, capped retries, exponential backoff, jitter, and idempotency for side-effecting APIs. See Timeouts, retries, and backoff with jitter and Making retries safe with idempotent APIs.

Firecrawl result used: both AWS Builders' Library articles for targeted one-result searches on retries/backoff and idempotent APIs.

Module concepts:

  • partial failure
  • timeout
  • retry amplification
  • exponential backoff
  • jitter
  • idempotency
  • overload control

Wrong Approach

"If a call times out, retry immediately until it succeeds."

Timeout does not prove the operation failed. It proves the caller stopped waiting. Immediate retry can duplicate side effects and increase load on the already-slow dependency.

Better Approach

Design retry as a load-control mechanism:

Timeout:
based on downstream latency distribution and caller budget

Retry:
bounded attempts
exponential backoff
jitter
idempotency key for side effects

Failure:
degrade or fail closed after budget is exhausted

Retry policy belongs in the API contract and client library, not as scattered loops around HTTP calls.

Tradeoff Table

ChoiceGainCost
no retriesavoids amplificationtransient faults become user failures
immediate retriesquick recovery for rare dropsoverload and duplicate effects
backoff with jittersmoother load during incidentshigher tail latency
idempotency keyssafe side-effect retryrequires storage and request identity rules

Failure Mode

The retry policy turns a small downstream slowdown into a system-wide outage.

Required Artifact

Write a retry budget:

Operation:
Side effect?:
Timeout:
Max attempts:
Backoff:
Jitter:
Idempotency key:
Circuit/degrade rule:
Metrics:

Project / Capstone Connection

Every capstone API that calls another service should include a timeout/retry/idempotency contract.


Case Study 5: Wall-Clock Timestamp Used As Causal Truth

Scenario: A collaborative editor accepts updates from multiple regions. The conflict resolver picks the update with the largest updated_at timestamp. One region's clock jumps forward after time synchronization. Its stale edit wins over a causally later edit from another region.

Source anchor: Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" defines happens-before and logical clocks for reasoning about event order without relying on synchronized physical clocks. AWS EC2 documentation describes time synchronization services, but clock synchronization still provides a time reference, not a causal-ordering proof. See Lamport time/clocks paper and AWS EC2 time synchronization.

Firecrawl result used: Lamport's paper and AWS EC2 time-sync docs from targeted searches.

Module concepts:

  • wall-clock time
  • clock skew
  • happens-before
  • Lamport clocks
  • vector clocks
  • concurrent updates

Wrong Approach

"NTP keeps clocks synced, so timestamps tell us which edit happened last."

Physical clocks can move, skew, and disagree. Even accurate clocks do not encode message causality unless the system explicitly models it.

Better Approach

Choose the right ordering tool:

Need total display order:
Lamport clock + tie-breaker may be enough

Need detect concurrent edits:
vector clock or version vector

Need real-time external order:
use a database/system that explicitly provides it

Need merge user edits:
CRDT/OT or domain-specific merge rule

Timestamp ordering can be acceptable for approximate UI sorting. It is unsafe as a correctness mechanism for causality.

Tradeoff Table

ChoiceGainCost
wall-clock last-write-winssimpleloses causality and can discard valid edits
Lamport clockcaptures happens-before ordercannot detect all concurrency by itself
vector clockdetects concurrent historiesmetadata grows with participants
CRDT/OT mergepreserves concurrent workmore complex data model

Failure Mode

The system silently overwrites a user's newer work because "newer timestamp" was confused with "causally newer."

Required Artifact

Draw a three-node event trace:

Node A events:
Node B events:
Messages:
Wall-clock timestamps:
Lamport timestamps:
Vector timestamps:
Concurrent pairs:
Conflict rule:

Project / Capstone Connection

Any capstone with offline edits, multi-region writes, chat, document collaboration, or event ordering needs a real ordering model.


Case Study 6: CAP During A Real Partition, Not As A Slogan

Scenario: A profile service is replicated across two regions. During a network partition, both regions can still serve users. Product asks engineering to keep profile updates available everywhere and still guarantee that every read sees the latest write.

Source anchor: Gilbert and Lynch's CAP proof formalizes the tension between consistency, availability, and partition tolerance in an asynchronous network model. The practical lesson is not "pick two forever"; it is that under a partition, a system that must preserve strong consistency may have to reject or delay some operations. See Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.

Firecrawl result used: the Gilbert-Lynch CAP paper for CAP theorem Brewer Gilbert Lynch paper PDF consistency availability partition tolerance.

Module concepts:

  • CAP
  • partition tolerance
  • availability
  • linearizability
  • degraded mode
  • consistency contract

Wrong Approach

"CAP says we picked AP, so stale or conflicting updates are fine everywhere."

That is too blunt. The correct design can be per operation. Some writes may be unavailable under partition while other reads or low-risk updates continue.

Better Approach

Classify operations:

Strongly consistent:
password changes
permission changes
billing profile

Available/degraded:
display name
avatar
preferences

Conflict-resolved:
bio text
notification settings

For each operation, define partition behavior and user-facing state. Do not use "eventual consistency" as a blanket excuse for unsafe writes.

Tradeoff Table

ChoiceGainCost
reject sensitive writes during partitionpreserves strong invariantreduced availability
accept all writes in both regionshigh local availabilityconflict resolution and stale reads
per-operation consistency policymatches business riskmore design and implementation work
read-only degraded modesimple safety postureusers cannot complete some workflows

Failure Mode

The service stays "available" by accepting conflicting permission or billing updates that require manual cleanup.

Required Artifact

Write a consistency matrix:

Operation:
Partition behavior:
Read freshness:
Conflict policy:
User-facing message:
Reconciliation path:
Metric/alert:

Project / Capstone Connection

If a capstone mentions CAP, include an operation-level matrix instead of a one-line CP/AP label.


Case Study 7: Stale Read In A Consensus-Backed Store

Scenario: A team uses etcd because it is "Raft-backed." They assume every read is automatically linearizable. A client performs a read from a member that is partitioned from the leader or is serving through a path that does not verify current leadership, and receives stale data.

Source anchor: Jepsen's etcd 3.4.3 analysis describes etcd as a Raft-based distributed database and examines stale-read behavior and consistency under faults. Use it as a reminder that real client APIs and read modes matter even when the core replication algorithm is strong. See Jepsen: etcd 3.4.3.

Firecrawl result used: the Jepsen etcd analysis for site:jepsen.io analyses etcd 3.4 raft consensus partitions stale reads.

Module concepts:

  • linearizable read
  • stale read
  • Raft
  • leader lease
  • client API contract
  • verification under fault

Wrong Approach

"Raft system means every operation is automatically linearizable."

Consensus is a mechanism inside the system. The client-visible guarantee depends on the API path, read mode, leader checks, leases, and implementation.

Better Approach

Read the client contract and test the fault mode:

For every read path:
is it served by leader?
does it go through quorum/ReadIndex?
does it rely on a lease?
what happens during partition or clock pause?

Then write tests that inject partitions and pauses while checking the exact history property the application needs.

Tradeoff Table

ChoiceGainCost
local/stale readlow latencymay violate freshness assumptions
leader linearizable readstrong semanticsmore coordination and latency
quorum read pathavoids stale followershigher read cost
Jepsen-style fault testingvalidates actual behaviortest complexity and time

Failure Mode

The architecture is correct on paper, but the application uses a weaker API path than the design assumes.

Required Artifact

Write a read-contract test plan:

Read API:
Expected consistency:
Leader/follower behavior:
Partition injected:
Pause injected:
History checker:
Acceptable stale window:
Failure evidence:

Project / Capstone Connection

If your capstone uses a consensus-backed store for locks, config, or leader election, test the exact read/write API semantics.


Source Map

SourceUse it for
GitHub October 21 post-incident analysisnetwork partition, failover, database recovery, operational tradeoffs
Raft paperleader election, replicated logs, quorum commit, uncommitted entries
etcd disaster recoveryquorum failure tolerance and snapshot-based recovery
AWS timeouts, retries, and backoff with jitterretry amplification, backoff, jitter, timeout design
AWS idempotent APIssafe retries for side-effecting APIs
Lamport time/clocks paperhappens-before, logical clocks, event ordering
AWS EC2 time synchronizationphysical time synchronization as a time reference, not causal proof
Gilbert-Lynch CAP paperconsistency, availability, partition tolerance tradeoff under partition
Jepsen: etcd 3.4.3consensus-backed store behavior under faults and stale-read analysis

Completion Standard

  • At least three case-study artifacts are completed.
  • At least one artifact includes a partition or quorum diagram.
  • At least one artifact includes a timeout/retry/idempotency budget.
  • At least one artifact includes a logical-clock or vector-clock trace.
  • At least one artifact includes an operation-level CAP/consistency matrix.
  • At least one artifact tests the actual API semantics of a consensus-backed service.
  • At least one case connects back to Module 3 replication/partitioning and Module 4 transactions/consistency.