Module 5: Distributed Systems Fundamentals: Case Studies

These case studies are about the part of engineering that unit tests rarely teach: messages arrive late, clocks disagree, leaders pause, retries multiply load, and partitions make "just keep serving" an unsafe requirement. Work through each case by drawing the failure model first, then the mechanism.

How To Use These Case Studies

State the failure model before proposing a fix.
Draw the timeline, partition, quorum, or message graph.
Name the mechanism: timeout, logical clock, vector clock, quorum, consensus, fencing token, idempotency key, or backoff.
Identify what is guaranteed and what is only best effort.
Produce the required artifact and connect it to a capstone operation.

Case Study 1: GitHub's Network Partition And The Cost Of Split Databases

Scenario: A data-center network event separates primary and secondary database sites. Writes continue in a degraded topology. When connectivity returns, the system must reconcile database state, delay normal service, and protect data correctness before fully recovering.

Source anchor: GitHub's October 21, 2018 post-incident analysis describes a network partition between US East Coast facilities, database failover complications, and a long recovery process to preserve data integrity. See GitHub October 21 post-incident analysis.

Firecrawl result used: the GitHub postmortem as the top result for GitHub October 21 2018 network partition postmortem.

Module concepts:

partial failure
network partition
split-brain risk
failover
recovery over raw availability
operational runbook

Wrong Approach

"If a region is unreachable, promote the other side and keep serving all writes."

That can preserve short-term availability while creating divergent histories that are expensive or impossible to merge safely.

Better Approach

Write the failover policy around data safety:

During partition:
  identify which side has write authority
  fence old leaders
  restrict unsafe writes if authority is ambiguous

After partition heals:
  stop automatic churn
  compare replication positions
  decide recovery order
  reconcile or rebuild replicas
  communicate degraded modes

The mature move is not "always stay writable." It is knowing which operations must stop when the system cannot prove a single authority.

Tradeoff Table

Choice	Gain	Cost
keep both sides writable	high immediate availability	divergent state and reconciliation risk
single write authority	preserves a clearer history	some users lose write availability
read-only degraded mode	protects data while serving partial value	product capability is reduced
manual recovery gates	avoids automated damage	slower recovery and operational load

Failure Mode

The system recovers network connectivity but not logical consistency. The outage continues because data histories diverged.

Required Artifact

Write a partition runbook:

Authority source:
Fence mechanism:
Writes allowed:
Reads allowed:
Replication-position check:
Recovery order:
User-visible mode:
Abort condition:

Project / Capstone Connection

Any capstone with replicas, regions, or failover should include a "who may write during partition?" answer.

Case Study 2: Raft Majority Prevents Two Leaders From Committing

Scenario: A five-node coordination cluster partitions into {N1, N2} and {N3, N4, N5}. 1 was leader before the partition. Clients can still reach 1, so the team expects 1` to continue accepting writes.

Source anchor: The Raft paper explains leader election, terms, replicated logs, and the majority rule that allows a leader to commit log entries only after replication to a majority. See In Search of an Understandable Consensus Algorithm.

Firecrawl result used: the Raft paper from raft.github.io for site:raft.github.io raft consensus paper.

Module concepts:

consensus
leader election
quorum
replicated log
partition tolerance
committed vs uncommitted entries

Wrong Approach

"A node that still believes it is leader can keep committing writes for its clients."

Belief is not enough. Commitment requires a majority. The minority-side leader may append local entries, but it cannot safely commit them.

Better Approach

Reason from quorum intersection:

Cluster size: 5
Majority: 3
Partition A: N1, N2
Partition B: N3, N4, N5

N1 side:
  can receive client writes
  cannot replicate to majority
  cannot commit new entries

N3/N4/N5 side:
  can elect a leader
  can commit entries with majority

When the partition heals, uncommitted minority entries can be overwritten by the new leader's log.

Tradeoff Table

Choice	Gain	Cost
require majority commit	prevents split-brain commits	minority side loses write availability
allow minority writes	local availability	conflicting committed histories
increase cluster size	tolerate more failures	more coordination and operational cost
colocate quorum carefully	better availability model	placement complexity

Failure Mode

The minority leader returns success for writes that later disappear, because they were never committed by a quorum.

Required Artifact

Draw a Raft partition trace:

Initial leader:
Partition sets:
Majority size:
Can old leader commit?:
Can new leader be elected?:
Which entries survive after heal?:
Client response rule:

Project / Capstone Connection

If a capstone uses a coordination system, distinguish "accepted by a node" from "committed by quorum."

Case Study 3: etcd Disaster Recovery And Quorum Math

Scenario: A Kubernetes control-plane dependency runs on a three-member etcd cluster. Two members are lost after disk corruption. An operator expects the remaining member to keep the cluster alive because one copy of the data still exists.

Source anchor: etcd disaster recovery docs state that a cluster tolerates up to (N-1)/2 permanent failures for ` members, and that recovery from a snapshot creates a new logical cluster. See etcd disaster recovery.

Firecrawl result used: the official etcd disaster recovery page for site:etcd.io docs raft leader election quorum official docs etcd disaster recovery.

Module concepts:

quorum
failure tolerance
consensus availability
snapshot recovery
cluster identity
disaster recovery

Wrong Approach

"One remaining node means the data still exists, so the cluster can keep operating."

Consensus systems need a quorum, not merely any surviving copy. A single node from a three-node cluster cannot prove it is the current authority after losing the other two.

Better Approach

Plan recovery around quorum and snapshots:

Normal 3-member cluster:
  tolerates 1 permanent failure
  requires 2 members for quorum

After losing 2 members:
  no quorum
  normal writes stop
  restore from snapshot into a new logical cluster
  verify revision/compaction expectations for clients

Backup frequency is part of the availability story. Without snapshots, "consensus protects us" is incomplete.

Tradeoff Table

Choice	Gain	Cost
3 members	simple and common	tolerates only one permanent failure
5 members	tolerates two permanent failures	more nodes and slower quorum path
snapshot restore	recovers from quorum loss	possible data loss since snapshot
unsafe single-node continuation	apparent availability	violates consensus authority assumptions

Failure Mode

The operator turns disaster recovery into a split-brain event by forcing a lone member to act as the old cluster.

Required Artifact

Write a quorum/recovery worksheet:

Cluster size:
Quorum size:
Permanent failures tolerated:
Snapshot interval:
Maximum data loss:
Restore procedure:
Client reconfiguration:
Validation checks:

Project / Capstone Connection

If a capstone relies on etcd, ZooKeeper, Consul, or a Raft service, include quorum math and backup/restore evidence.

Case Study 4: Retry Storm After A Partial Outage

Scenario: A payment API slows down because a downstream risk service is overloaded. Clients time out at 500 ms and retry immediately. The retry traffic doubles load, causes more timeouts, and spreads the incident to healthy services.

Source anchor: Amazon Builders' Library explains that retries are selfish, can amplify overload, and need timeouts, capped retries, exponential backoff, jitter, and idempotency for side-effecting APIs. See Timeouts, retries, and backoff with jitter and Making retries safe with idempotent APIs.

Firecrawl result used: both AWS Builders' Library articles for targeted one-result searches on retries/backoff and idempotent APIs.

Module concepts:

partial failure
timeout
retry amplification
exponential backoff
jitter
idempotency
overload control

Wrong Approach

"If a call times out, retry immediately until it succeeds."

Timeout does not prove the operation failed. It proves the caller stopped waiting. Immediate retry can duplicate side effects and increase load on the already-slow dependency.

Better Approach

Design retry as a load-control mechanism:

Timeout:
  based on downstream latency distribution and caller budget

Retry:
  bounded attempts
  exponential backoff
  jitter
  idempotency key for side effects

Failure:
  degrade or fail closed after budget is exhausted

Retry policy belongs in the API contract and client library, not as scattered loops around HTTP calls.

Tradeoff Table

Choice	Gain	Cost
no retries	avoids amplification	transient faults become user failures
immediate retries	quick recovery for rare drops	overload and duplicate effects
backoff with jitter	smoother load during incidents	higher tail latency
idempotency keys	safe side-effect retry	requires storage and request identity rules

Failure Mode

The retry policy turns a small downstream slowdown into a system-wide outage.

Required Artifact

Write a retry budget:

Operation:
Side effect?:
Timeout:
Max attempts:
Backoff:
Jitter:
Idempotency key:
Circuit/degrade rule:
Metrics:

Project / Capstone Connection

Every capstone API that calls another service should include a timeout/retry/idempotency contract.

Case Study 5: Wall-Clock Timestamp Used As Causal Truth

Scenario: A collaborative editor accepts updates from multiple regions. The conflict resolver picks the update with the largest updated_at timestamp. One region's clock jumps forward after time synchronization. Its stale edit wins over a causally later edit from another region.

Source anchor: Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" defines happens-before and logical clocks for reasoning about event order without relying on synchronized physical clocks. AWS EC2 documentation describes time synchronization services, but clock synchronization still provides a time reference, not a causal-ordering proof. See Lamport time/clocks paper and AWS EC2 time synchronization.

Firecrawl result used: Lamport's paper and AWS EC2 time-sync docs from targeted searches.

Module concepts:

wall-clock time
clock skew
happens-before
Lamport clocks
vector clocks
concurrent updates

Wrong Approach

"NTP keeps clocks synced, so timestamps tell us which edit happened last."

Physical clocks can move, skew, and disagree. Even accurate clocks do not encode message causality unless the system explicitly models it.

Better Approach

Choose the right ordering tool:

Need total display order:
  Lamport clock + tie-breaker may be enough

Need detect concurrent edits:
  vector clock or version vector

Need real-time external order:
  use a database/system that explicitly provides it

Need merge user edits:
  CRDT/OT or domain-specific merge rule

Timestamp ordering can be acceptable for approximate UI sorting. It is unsafe as a correctness mechanism for causality.

Tradeoff Table

Choice	Gain	Cost
wall-clock last-write-wins	simple	loses causality and can discard valid edits
Lamport clock	captures happens-before order	cannot detect all concurrency by itself
vector clock	detects concurrent histories	metadata grows with participants
CRDT/OT merge	preserves concurrent work	more complex data model

Failure Mode

The system silently overwrites a user's newer work because "newer timestamp" was confused with "causally newer."

Required Artifact

Draw a three-node event trace:

Node A events:
Node B events:
Messages:
Wall-clock timestamps:
Lamport timestamps:
Vector timestamps:
Concurrent pairs:
Conflict rule:

Project / Capstone Connection

Any capstone with offline edits, multi-region writes, chat, document collaboration, or event ordering needs a real ordering model.

Case Study 6: CAP During A Real Partition, Not As A Slogan

Scenario: A profile service is replicated across two regions. During a network partition, both regions can still serve users. Product asks engineering to keep profile updates available everywhere and still guarantee that every read sees the latest write.

Source anchor: Gilbert and Lynch's CAP proof formalizes the tension between consistency, availability, and partition tolerance in an asynchronous network model. The practical lesson is not "pick two forever"; it is that under a partition, a system that must preserve strong consistency may have to reject or delay some operations. See Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.

Firecrawl result used: the Gilbert-Lynch CAP paper for CAP theorem Brewer Gilbert Lynch paper PDF consistency availability partition tolerance.

Module concepts:

CAP
partition tolerance
availability
linearizability
degraded mode
consistency contract

Wrong Approach

"CAP says we picked AP, so stale or conflicting updates are fine everywhere."

That is too blunt. The correct design can be per operation. Some writes may be unavailable under partition while other reads or low-risk updates continue.

Better Approach

Classify operations:

Strongly consistent:
  password changes
  permission changes
  billing profile

Available/degraded:
  display name
  avatar
  preferences

Conflict-resolved:
  bio text
  notification settings

For each operation, define partition behavior and user-facing state. Do not use "eventual consistency" as a blanket excuse for unsafe writes.

Tradeoff Table

Choice	Gain	Cost
reject sensitive writes during partition	preserves strong invariant	reduced availability
accept all writes in both regions	high local availability	conflict resolution and stale reads
per-operation consistency policy	matches business risk	more design and implementation work
read-only degraded mode	simple safety posture	users cannot complete some workflows

Failure Mode

The service stays "available" by accepting conflicting permission or billing updates that require manual cleanup.

Required Artifact

Write a consistency matrix:

Operation:
Partition behavior:
Read freshness:
Conflict policy:
User-facing message:
Reconciliation path:
Metric/alert:

Project / Capstone Connection

If a capstone mentions CAP, include an operation-level matrix instead of a one-line CP/AP label.

Case Study 7: Stale Read In A Consensus-Backed Store

Scenario: A team uses etcd because it is "Raft-backed." They assume every read is automatically linearizable. A client performs a read from a member that is partitioned from the leader or is serving through a path that does not verify current leadership, and receives stale data.

Source anchor: Jepsen's etcd 3.4.3 analysis describes etcd as a Raft-based distributed database and examines stale-read behavior and consistency under faults. Use it as a reminder that real client APIs and read modes matter even when the core replication algorithm is strong. See Jepsen: etcd 3.4.3.

Firecrawl result used: the Jepsen etcd analysis for site:jepsen.io analyses etcd 3.4 raft consensus partitions stale reads.

Module concepts:

linearizable read
stale read
Raft
leader lease
client API contract
verification under fault

Wrong Approach

"Raft system means every operation is automatically linearizable."

Consensus is a mechanism inside the system. The client-visible guarantee depends on the API path, read mode, leader checks, leases, and implementation.

Better Approach

Read the client contract and test the fault mode:

For every read path:
  is it served by leader?
  does it go through quorum/ReadIndex?
  does it rely on a lease?
  what happens during partition or clock pause?

Then write tests that inject partitions and pauses while checking the exact history property the application needs.

Tradeoff Table

Choice	Gain	Cost
local/stale read	low latency	may violate freshness assumptions
leader linearizable read	strong semantics	more coordination and latency
quorum read path	avoids stale followers	higher read cost
Jepsen-style fault testing	validates actual behavior	test complexity and time

Failure Mode

The architecture is correct on paper, but the application uses a weaker API path than the design assumes.

Required Artifact

Write a read-contract test plan:

Read API:
Expected consistency:
Leader/follower behavior:
Partition injected:
Pause injected:
History checker:
Acceptable stale window:
Failure evidence:

Project / Capstone Connection

If your capstone uses a consensus-backed store for locks, config, or leader election, test the exact read/write API semantics.

Source Map

Source	Use it for
GitHub October 21 post-incident analysis	network partition, failover, database recovery, operational tradeoffs
Raft paper	leader election, replicated logs, quorum commit, uncommitted entries
etcd disaster recovery	quorum failure tolerance and snapshot-based recovery
AWS timeouts, retries, and backoff with jitter	retry amplification, backoff, jitter, timeout design
AWS idempotent APIs	safe retries for side-effecting APIs
Lamport time/clocks paper	happens-before, logical clocks, event ordering
AWS EC2 time synchronization	physical time synchronization as a time reference, not causal proof
Gilbert-Lynch CAP paper	consistency, availability, partition tolerance tradeoff under partition
Jepsen: etcd 3.4.3	consensus-backed store behavior under faults and stale-read analysis

Completion Standard

At least three case-study artifacts are completed.
At least one artifact includes a partition or quorum diagram.
At least one artifact includes a timeout/retry/idempotency budget.
At least one artifact includes a logical-clock or vector-clock trace.
At least one artifact includes an operation-level CAP/consistency matrix.
At least one artifact tests the actual API semantics of a consensus-backed service.
At least one case connects back to Module 3 replication/partitioning and Module 4 transactions/consistency.

How To Use These Case Studies​

Case Study 1: GitHub's Network Partition And The Cost Of Split Databases​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 2: Raft Majority Prevents Two Leaders From Committing​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 3: etcd Disaster Recovery And Quorum Math​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 4: Retry Storm After A Partial Outage​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 5: Wall-Clock Timestamp Used As Causal Truth​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 6: CAP During A Real Partition, Not As A Slogan​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Case Study 7: Stale Read In A Consensus-Backed Store​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Required Artifact​

Project / Capstone Connection​

Source Map​

Completion Standard​

How To Use These Case Studies

Case Study 1: GitHub's Network Partition And The Cost Of Split Databases

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 2: Raft Majority Prevents Two Leaders From Committing

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 3: etcd Disaster Recovery And Quorum Math

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 4: Retry Storm After A Partial Outage

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 5: Wall-Clock Timestamp Used As Causal Truth

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 6: CAP During A Real Partition, Not As A Slogan

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Case Study 7: Stale Read In A Consensus-Backed Store

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Required Artifact

Project / Capstone Connection

Source Map

Completion Standard