Skip to main content

Failure Model Workshop

Retrieval Prompts

  1. State the definitions of crash-stop, crash-recovery, omission, and Byzantine failure.
  2. State the quorum requirement for crash-fault-tolerant consensus and for Byzantine-fault-tolerant consensus.
  3. State what "completeness" and "accuracy" mean for a failure detector.
  4. State the eight fallacies of distributed computing from memory.
  5. State why no timeout can distinguish a slow process from a dead one.

Compare and Distinguish

Separate these pairs cleanly:

  • crash-stop vs crash-recovery
  • omission vs Byzantine
  • completeness vs accuracy
  • suspected vs confirmed dead
  • fixed timeout vs phi-accrual threshold
  • gossip vs SWIM

Common Mistake Check

For each statement, identify the error:

  1. "Since we run a modern cloud, we don't have partial failures anymore."
  2. "TCP told us the connection is broken, so the peer process is down."
  3. "We need Byzantine fault tolerance because our nodes sometimes return weird data."
  4. "Our heartbeat is every 10ms with a 30ms timeout, so we detect failures fast."
  5. "A failure detector that always says 'all alive' is at least safe."

Fallacy Audit

Pick a production system you have used or built. For each of the eight fallacies, write one sentence: either "handled by X" or "implicitly assumed (latent bug)." If you cannot name a handling mechanism for at least five of the eight, your system has a lot of latent bugs; mark them as a to-do.

Scenario Classification

For each incident description, classify the failure model that best fits, and explain which detector and which mitigation are appropriate:

  1. Incident A: a node's hard drive is failing silently, returning stale data from read requests for half an hour before it reports I/O errors.
  2. Incident B: a GC pause of 4 seconds on the Raft leader causes a follower to start an election.
  3. Incident C: a misconfigured route in the router drops 20% of packets between two zones but not others.
  4. Incident D: a rogue operator manually removes a replica from rotation while it still holds the leader lease.
  5. Incident E: a bug in a client library sends malformed messages that cause one of three replicas to compute a different result.

For each, name the weakest failure model that captures the behavior, and the weakest that doesn't capture it.

Phi-Accrual Tuning Drill

You run a Cassandra-like system across three AWS availability zones with typical steady-state inter-node RTT of 1-3ms, a 99th-percentile RTT of 30ms, and GC pauses up to 500ms on the JVM. The application tolerates 10 false-positive suspicions per node per day.

  1. Estimate the window of inter-arrival times you need to track.
  2. Propose a phi threshold that gives roughly 10 false positives per day given a heartbeat every 1s.
  3. Compare to a fixed-timeout design. What timeout value would you pick, and what detection delay would you accept?
  4. Describe what changes if the cluster is across regions with 50ms base latency.

Evidence Check

This practice page is complete only if you can:

  • Classify any real incident into crash-stop, crash-recovery, omission, or Byzantine within a minute of hearing it.
  • Justify the failure model assumed by a specific algorithm (Raft, Paxos, PBFT) and what breaks if the model is violated.
  • Tune a phi-accrual detector for a described workload and defend the choice.
  • Name the failure model that quietly underpins a production system you have operated.