Skip to main content

Reason About Failure: What Happens When X Dies?

What This Concept Is

Failure reasoning is the habit of walking your diagram and asking, for every single box and arrow, what happens when this dies? Not "if"; the component will die. The useful question is what the user sees, whether the system recovers, and how long recovery takes.

For each box you kill, there are four things to describe:

  1. Immediate user impact: errors, timeouts, degraded responses, no impact.
  2. Blast radius: which other components are affected, and for how long.
  3. Recovery mechanism: automatic failover, health-check ejection, retry, circuit breaker, or manual runbook.
  4. Time to recover (TTR): seconds (in-process retry), minutes (health-check failover), hours (manual intervention).

For each arrow you interrupt, ask the same four questions. Slow arrows are worse than dead ones because they consume threads, sockets, and memory.

A mature framing from the Principles of Chaos Engineering and the AWS Builders' Library: failure is a control variable, not a surprise. You introduce specific failure modes (hard kill, slow response, partial outage, network partition, clock skew) and compare steady-state behavior before and after. The goal of the paper-walk here is to anticipate each of those experiments before they run -- to move failure modes from "unexpected bug" to "documented, bounded, survivable".

Why It Matters Here

"Highly available" is a claim; failure reasoning is the proof.

  • A design with a hidden SPOF can be 99% available and feel invincible, until the SPOF fails and the whole thing goes dark.
  • A design with thoughtful failure modes degrades gracefully: read-only mode, stale cache, reduced features.
  • Reviewers often test exactly this: "what happens if the primary DB goes down? what if only the replica fails? what if the network between them partitions?"

You cannot improvise this under pressure. You either walked through it already, or you are guessing.

Concrete Example: Social Feed Failure Walk

For the social feed design (abridged), kill each box in turn:

Box killedUser impactBlast radiusRecoveryTTR
CDN PoPSlight latency rise for static assets in that regionRegion onlyClient DNS/anycast reroutesSeconds
Global LBRegional DNS failover requiredAll users in that region until DNS TTL expiresDNS/anycast reroute30 s - 5 min
Regional LB5-30 s blip while health checks shift trafficThat region onlyActive-active with peer LBsSeconds
Feed Service instanceZero; another instance serves the requestNone (instance-level)LB ejects on health10-30 s
Redis cache (single node in cluster)Slight P99 rise; cold-cache reads hit DBTemporarily higher DB loadClient-side consistent hashing rebalancesMinutes to warm
Redis cluster (entire cache tier)Heavy DB load; origin may tip overWhole regionKeep DB sized to survive; circuit-break non-criticalUntil cache restored
DB primary shardWrites to that shard failUsers whose user_id hashes to that shardAutomatic failover to replica; promote30 s - 2 min
DB primary all shardsWrites globally failEveryoneRegional failover5-30 min
Kafka topicDerived views (search, analytics) stop updatingNon-critical; user-visible degraded featuresKafka replica takes over; consumer catches upMinutes
Fan-out workerTimeline updates backlogged; new posts visible lateNon-criticalQueue processes on restartMinutes
Follow-graph serviceFan-out stops; posts land in author-only viewModerate; read-fanout can degradeService restart or replicaMinutes

Notice what falls out of this exercise:

  • The Redis cluster is effectively a soft SPOF: a total failure there causes cascading origin overload. Remediation: size the DB to survive cache-cold; add circuit breakers that shed non-critical reads.
  • Fan-out being behind Kafka makes it naturally degradable: slowness is visible only as stale timelines, not errors.
  • DB primary failover determines your availability budget. 30 s of failover at one failure/month = ~99.99%; 2 minutes = closer to 99.95%.

For an AZ outage, walk the same table at the AZ level: is every tier cross-AZ? Can it lose one AZ and still serve writes?

For a region outage, walk again: are you active-active, active-passive, or cold-DR?

Concrete Example 2: Slow Arrow vs Dead Arrow

Kill not the Redis cluster, but make it respond in 5 s instead of 2 ms:

  • Every Feed Service request blocks on the cache lookup. Threads back up. The connection pool exhausts.
  • Health checks on the Feed Service begin failing because new connections time out waiting for a thread.
  • The LB ejects Feed Service instances as unhealthy. The remaining instances receive more traffic and exhaust their pools faster. This is a cascade.
  • Net result: from the user's perspective, one slow component has taken the entire hot path offline, even though no component has technically died.

Remediations (every one must be present):

  1. Timeout budget. Feed Service Redis timeout = 50 ms, not 5 s. If Redis is slow, fail fast.
  2. Bulkhead. Redis client has a bounded connection pool per-endpoint. If one pool saturates, others do not.
  3. Circuit breaker. After N consecutive Redis timeouts, open the circuit for 30 s and serve cache-miss path directly.
  4. Backpressure on callers. Return 429/503 with a retry-after header when the Feed Service itself is saturated; do not quietly absorb.
  5. Retry with jitter. If we retry at all, exponential backoff + random jitter -- the AWS Builders' Library pattern -- to avoid synchronized retry storms.

Each of these is a line on the diagram's per-box annotation. Missing any one converts a 5 s Redis blip into a 10 min incident. Slow failures cause more outages than hard failures because the system keeps trying instead of giving up.

Common Confusion / Misconceptions

"We have replicas, so we are fine." Replicas protect against single-machine failure, not against network partitions, not against software bugs that kill every instance, not against failed failovers, not against cascade from a cold cache.

"Timeouts save us." Timeouts that are too long lock threads; timeouts that are too short produce retry storms. Timeouts must be budgeted alongside the P99 target and must get smaller as you go deeper into the call stack.

"Retries make it safer." Retries without backoff and jitter amplify failure into an outage. Retries without idempotency corrupt data. "Retry on failure" is a decision, not a default.

"Graceful degradation is always possible." It is not. Some operations (money transfer, reservation, authentication) cannot degrade gracefully. Be honest about which parts of the system are degradable and which are "fail loudly and stop".

"The runbook is the plan." A runbook is a fallback when automation fails. Systems that require a runbook for every incident have baked operational cost into the design -- legitimately, sometimes, but you have to say so.

"Hard failures are the problem." Gray failures -- slow, partial, or asymmetric -- cause more production incidents than outright crashes. Always walk the slow-arrow case in addition to the dead-box case.

How To Use It

The failure-walk protocol. For each diagram:

  1. List every box.
  2. For each box, fill the four-column table: impact, blast radius, recovery, TTR.
  3. Repeat for each arrow (slow path and dead path).
  4. Repeat at AZ granularity; at region granularity.
  5. Identify every cell where TTR is "manual" or "hours" -- those are the designs' load-bearing operational assumptions.
  6. For the top 2-3 worst cells, propose a concrete remediation or accept the cost in writing.

Transfer / Where This Shows Up Later

  • Cluster 4 concept 12 (SPOFs) consolidates the worst cells from the failure table into a ranked SPOF list.
  • Cluster 5 concept 15 (design doc) has a dedicated failure-walk section -- this is literally what goes there.
  • S8M4 (scale/reliability/performance) runs these scenarios as chaos experiments in production (GameDays, automated fault injection).
  • S9 (cloud + DevOps) implements the remediations -- circuit breakers (envoy/istio), health checks (ALB, NLB), timeouts, runbooks, DR plans.
  • S10 capstone/interviews: "what happens if X dies?" is the single most common stress-test question. Candidates who answer it as a prepared table rather than improvisation signal seniority.

Check Yourself

  1. For your URL shortener design, what happens if the cache cluster goes dark? Can the DB survive full origin load?
  2. What is the difference between a crash-only failure (process dies) and a slow failure (process hangs)? Which is harder to handle?
  3. Why is a 300 ms timeout with 3 retries potentially worse than a 1 s timeout with 1 retry?
  4. Name the three mandatory defenses against a slow-arrow cascade. What is the order of operations you would deploy them in?

Mini Drill or Application

For any two of your Cluster 2 designs, complete the full failure-walk table. For each:

  1. Kill every box.
  2. Kill every arrow (time out the call).
  3. Kill an entire AZ.
  4. Identify the worst cell and propose one remediation.
  5. Write one sentence naming the operational assumption that, if violated, causes the system to degrade.

Read This Only If Stuck