Reason About Failure: What Happens When X Dies?
What This Concept Is
Failure reasoning is the habit of walking your diagram and asking, for every single box and arrow, what happens when this dies? Not "if"; the component will die. The useful question is what the user sees, whether the system recovers, and how long recovery takes.
For each box you kill, there are four things to describe:
- Immediate user impact: errors, timeouts, degraded responses, no impact.
- Blast radius: which other components are affected, and for how long.
- Recovery mechanism: automatic failover, health-check ejection, retry, circuit breaker, or manual runbook.
- Time to recover (TTR): seconds (in-process retry), minutes (health-check failover), hours (manual intervention).
For each arrow you interrupt, ask the same four questions. Slow arrows are worse than dead ones because they consume threads, sockets, and memory.
A mature framing from the Principles of Chaos Engineering and the AWS Builders' Library: failure is a control variable, not a surprise. You introduce specific failure modes (hard kill, slow response, partial outage, network partition, clock skew) and compare steady-state behavior before and after. The goal of the paper-walk here is to anticipate each of those experiments before they run -- to move failure modes from "unexpected bug" to "documented, bounded, survivable".
Why It Matters Here
"Highly available" is a claim; failure reasoning is the proof.
- A design with a hidden SPOF can be 99% available and feel invincible, until the SPOF fails and the whole thing goes dark.
- A design with thoughtful failure modes degrades gracefully: read-only mode, stale cache, reduced features.
- Reviewers often test exactly this: "what happens if the primary DB goes down? what if only the replica fails? what if the network between them partitions?"
You cannot improvise this under pressure. You either walked through it already, or you are guessing.
Concrete Example: Social Feed Failure Walk
For the social feed design (abridged), kill each box in turn:
| Box killed | User impact | Blast radius | Recovery | TTR |
|---|---|---|---|---|
| CDN PoP | Slight latency rise for static assets in that region | Region only | Client DNS/anycast reroutes | Seconds |
| Global LB | Regional DNS failover required | All users in that region until DNS TTL expires | DNS/anycast reroute | 30 s - 5 min |
| Regional LB | 5-30 s blip while health checks shift traffic | That region only | Active-active with peer LBs | Seconds |
| Feed Service instance | Zero; another instance serves the request | None (instance-level) | LB ejects on health | 10-30 s |
| Redis cache (single node in cluster) | Slight P99 rise; cold-cache reads hit DB | Temporarily higher DB load | Client-side consistent hashing rebalances | Minutes to warm |
| Redis cluster (entire cache tier) | Heavy DB load; origin may tip over | Whole region | Keep DB sized to survive; circuit-break non-critical | Until cache restored |
| DB primary shard | Writes to that shard fail | Users whose user_id hashes to that shard | Automatic failover to replica; promote | 30 s - 2 min |
| DB primary all shards | Writes globally fail | Everyone | Regional failover | 5-30 min |
| Kafka topic | Derived views (search, analytics) stop updating | Non-critical; user-visible degraded features | Kafka replica takes over; consumer catches up | Minutes |
| Fan-out worker | Timeline updates backlogged; new posts visible late | Non-critical | Queue processes on restart | Minutes |
| Follow-graph service | Fan-out stops; posts land in author-only view | Moderate; read-fanout can degrade | Service restart or replica | Minutes |
Notice what falls out of this exercise:
- The Redis cluster is effectively a soft SPOF: a total failure there causes cascading origin overload. Remediation: size the DB to survive cache-cold; add circuit breakers that shed non-critical reads.
- Fan-out being behind Kafka makes it naturally degradable: slowness is visible only as stale timelines, not errors.
- DB primary failover determines your availability budget. 30 s of failover at one failure/month = ~99.99%; 2 minutes = closer to 99.95%.
For an AZ outage, walk the same table at the AZ level: is every tier cross-AZ? Can it lose one AZ and still serve writes?
For a region outage, walk again: are you active-active, active-passive, or cold-DR?
Concrete Example 2: Slow Arrow vs Dead Arrow
Kill not the Redis cluster, but make it respond in 5 s instead of 2 ms:
- Every Feed Service request blocks on the cache lookup. Threads back up. The connection pool exhausts.
- Health checks on the Feed Service begin failing because new connections time out waiting for a thread.
- The LB ejects Feed Service instances as unhealthy. The remaining instances receive more traffic and exhaust their pools faster. This is a cascade.
- Net result: from the user's perspective, one slow component has taken the entire hot path offline, even though no component has technically died.
Remediations (every one must be present):
- Timeout budget. Feed Service Redis timeout = 50 ms, not 5 s. If Redis is slow, fail fast.
- Bulkhead. Redis client has a bounded connection pool per-endpoint. If one pool saturates, others do not.
- Circuit breaker. After N consecutive Redis timeouts, open the circuit for 30 s and serve cache-miss path directly.
- Backpressure on callers. Return 429/503 with a retry-after header when the Feed Service itself is saturated; do not quietly absorb.
- Retry with jitter. If we retry at all, exponential backoff + random jitter -- the AWS Builders' Library pattern -- to avoid synchronized retry storms.
Each of these is a line on the diagram's per-box annotation. Missing any one converts a 5 s Redis blip into a 10 min incident. Slow failures cause more outages than hard failures because the system keeps trying instead of giving up.
Common Confusion / Misconceptions
"We have replicas, so we are fine." Replicas protect against single-machine failure, not against network partitions, not against software bugs that kill every instance, not against failed failovers, not against cascade from a cold cache.
"Timeouts save us." Timeouts that are too long lock threads; timeouts that are too short produce retry storms. Timeouts must be budgeted alongside the P99 target and must get smaller as you go deeper into the call stack.
"Retries make it safer." Retries without backoff and jitter amplify failure into an outage. Retries without idempotency corrupt data. "Retry on failure" is a decision, not a default.
"Graceful degradation is always possible." It is not. Some operations (money transfer, reservation, authentication) cannot degrade gracefully. Be honest about which parts of the system are degradable and which are "fail loudly and stop".
"The runbook is the plan." A runbook is a fallback when automation fails. Systems that require a runbook for every incident have baked operational cost into the design -- legitimately, sometimes, but you have to say so.
"Hard failures are the problem." Gray failures -- slow, partial, or asymmetric -- cause more production incidents than outright crashes. Always walk the slow-arrow case in addition to the dead-box case.
How To Use It
The failure-walk protocol. For each diagram:
- List every box.
- For each box, fill the four-column table: impact, blast radius, recovery, TTR.
- Repeat for each arrow (slow path and dead path).
- Repeat at AZ granularity; at region granularity.
- Identify every cell where TTR is "manual" or "hours" -- those are the designs' load-bearing operational assumptions.
- For the top 2-3 worst cells, propose a concrete remediation or accept the cost in writing.
Transfer / Where This Shows Up Later
- Cluster 4 concept 12 (SPOFs) consolidates the worst cells from the failure table into a ranked SPOF list.
- Cluster 5 concept 15 (design doc) has a dedicated failure-walk section -- this is literally what goes there.
- S8M4 (scale/reliability/performance) runs these scenarios as chaos experiments in production (GameDays, automated fault injection).
- S9 (cloud + DevOps) implements the remediations -- circuit breakers (envoy/istio), health checks (ALB, NLB), timeouts, runbooks, DR plans.
- S10 capstone/interviews: "what happens if X dies?" is the single most common stress-test question. Candidates who answer it as a prepared table rather than improvisation signal seniority.
Check Yourself
- For your URL shortener design, what happens if the cache cluster goes dark? Can the DB survive full origin load?
- What is the difference between a crash-only failure (process dies) and a slow failure (process hangs)? Which is harder to handle?
- Why is a 300 ms timeout with 3 retries potentially worse than a 1 s timeout with 1 retry?
- Name the three mandatory defenses against a slow-arrow cascade. What is the order of operations you would deploy them in?
Mini Drill or Application
For any two of your Cluster 2 designs, complete the full failure-walk table. For each:
- Kill every box.
- Kill every arrow (time out the call).
- Kill an entire AZ.
- Identify the worst cell and propose one remediation.
- Write one sentence naming the operational assumption that, if violated, causes the system to degrade.
Read This Only If Stuck
- System Design Primer: Availability patterns -- fail-over classes and their TTR.
- System Design Primer: CAP theorem -- what a partition does to your consistency/availability choice.
- System Design Primer: Load balancer -- health checks, ejection, anycast failover.
- System Design Primer: Database RDBMS replication -- replica-based failover semantics.
- System Design Primer: Asynchronism -- queues absorb the slow-downstream failure mode.
- Fundamentals: Preventing data loss (EDA) -- durability-under-failure patterns for event-driven flows.
- Fundamentals: Architecture characteristics ratings (reliability, fault tolerance) -- how different styles score on failure dimensions.
- Principles of Chaos Engineering (principlesofchaos.org) -- the canonical paper on treating failure as a control variable.
- Amazon Builders' Library -- Timeouts, retries, and backoff with jitter -- operational discipline for the slow-arrow case.
- AWS Well-Architected Framework -- Reliability Pillar -- structured checklist for failure-mode coverage.
- Google Cloud Well-Architected -- Reliability -- complementary view emphasizing graceful degradation.