Failure Model Workshop
Retrieval Prompts
- State the definitions of crash-stop, crash-recovery, omission, and Byzantine failure.
- State the quorum requirement for crash-fault-tolerant consensus and for Byzantine-fault-tolerant consensus.
- State what "completeness" and "accuracy" mean for a failure detector.
- State the eight fallacies of distributed computing from memory.
- State why no timeout can distinguish a slow process from a dead one.
Compare and Distinguish
Separate these pairs cleanly:
- crash-stop vs crash-recovery
- omission vs Byzantine
- completeness vs accuracy
- suspected vs confirmed dead
- fixed timeout vs phi-accrual threshold
- gossip vs SWIM
Common Mistake Check
For each statement, identify the error:
- "Since we run a modern cloud, we don't have partial failures anymore."
- "TCP told us the connection is broken, so the peer process is down."
- "We need Byzantine fault tolerance because our nodes sometimes return weird data."
- "Our heartbeat is every 10ms with a 30ms timeout, so we detect failures fast."
- "A failure detector that always says 'all alive' is at least safe."
Fallacy Audit
Pick a production system you have used or built. For each of the eight fallacies, write one sentence: either "handled by X" or "implicitly assumed (latent bug)." If you cannot name a handling mechanism for at least five of the eight, your system has a lot of latent bugs; mark them as a to-do.
Scenario Classification
For each incident description, classify the failure model that best fits, and explain which detector and which mitigation are appropriate:
- Incident A: a node's hard drive is failing silently, returning stale data from read requests for half an hour before it reports I/O errors.
- Incident B: a GC pause of 4 seconds on the Raft leader causes a follower to start an election.
- Incident C: a misconfigured route in the router drops 20% of packets between two zones but not others.
- Incident D: a rogue operator manually removes a replica from rotation while it still holds the leader lease.
- Incident E: a bug in a client library sends malformed messages that cause one of three replicas to compute a different result.
For each, name the weakest failure model that captures the behavior, and the weakest that doesn't capture it.
Phi-Accrual Tuning Drill
You run a Cassandra-like system across three AWS availability zones with typical steady-state inter-node RTT of 1-3ms, a 99th-percentile RTT of 30ms, and GC pauses up to 500ms on the JVM. The application tolerates 10 false-positive suspicions per node per day.
- Estimate the window of inter-arrival times you need to track.
- Propose a phi threshold that gives roughly 10 false positives per day given a heartbeat every 1s.
- Compare to a fixed-timeout design. What timeout value would you pick, and what detection delay would you accept?
- Describe what changes if the cluster is across regions with 50ms base latency.
Evidence Check
This practice page is complete only if you can:
- Classify any real incident into crash-stop, crash-recovery, omission, or Byzantine within a minute of hearing it.
- Justify the failure model assumed by a specific algorithm (Raft, Paxos, PBFT) and what breaks if the model is violated.
- Tune a phi-accrual detector for a described workload and defend the choice.
- Name the failure model that quietly underpins a production system you have operated.