Failure Modes: Cascading, Correlated, and Gray
What This Concept Is
Three failure patterns defeat naive "just add redundancy" thinking. Every production outage retrospective eventually names one.
- Cascading failure: one component's failure increases load on, or corrupts input to, a neighbor, which then fails, which then takes down its neighbor. The blast radius grows instead of stopping. Example primitives: thread-pool exhaustion, retry storms, connection-pool exhaustion, unbounded queue backlog.
- Correlated failure: multiple "independent" replicas fail at the same time because they were not independent after all. A shared power supply, a shared config push, a shared software bug, a shared DNS server, a shared certificate expiry, a shared deploy pipeline. The independence assumption in availability math is almost always partially false.
- Gray failure: the component appears healthy to monitoring and to most callers but is silently degraded for some users or some operations. "The dashboard is green but users are paging us" is the gray-failure tagline. The defining property is differential observability - internal observers see health; external observers see failure.
Richard Cook's How Complex Systems Fail gives the structural explanation: complex systems run in a constant state of slight disrepair, and catastrophe requires multiple small failures to line up. Redundancy alone cannot save a system whose "redundant" components share a hidden dependency or whose health checks cannot observe partial degradation.
Why It Matters Here
Most production incidents are not "one thing broke." They are one of these three patterns. Knowing the pattern tells you the mitigation:
- Cascading -> bulkheads, circuit breakers, timeouts, back-pressure, queue limits.
- Correlated -> failure domain isolation, blast-radius accounting, staged rollouts, diverse dependencies.
- Gray -> client-side signals, end-to-end probes, user-path monitoring, hedged requests.
Availability math that assumes independent failures (P(both fail) = P_A * P_B) is systematically optimistic. The System Design Primer's "parallel" availability formula assumes independence; in real systems it is almost never true.
Naming the pattern is also how you calibrate your response. "It's cascading, the circuit breaker should fire" is a different sentence than "it's correlated, we need to stop the config push." Without the vocabulary, both sound like "the service is broken" and get the same, usually wrong, response.
Concrete Example
Cascading (real numbers): Service A calls Service B. B is running a slow deploy; its p99 climbs from 100ms to 5s. A runs 100 threads and serves 500 req/s with 50ms p99. When B slows, Little's Law says A's queue depth climbs as lambda * W = 500 * 5 = 2,500 in-flight requests - many more than A has threads for. A's thread pool saturates in about 4 seconds. A now cannot serve anyone - even requests that do not touch B. The outage spreads from B to A. Add Service C, which depends on A; now C is down too. Fix: A should have a timeout (200ms) on calls to B and a circuit breaker that opens after 5 consecutive failures in a 10-second window, stays open for 30s, then probes. A should also have a bulkhead reserving a bounded thread pool (e.g., 20 threads) for B-dependent work so unrelated work keeps flowing.
Correlated (real numbers): Three database replicas are deployed in three AZs for "independence." The math says P(all three fail simultaneously) = (0.001)^3 = 10^-9; SLO calculation expects this never happens in 100 years. A bad config push (from a single control-plane service) hits all three replicas simultaneously and restarts them; P(config push is bad) = 0.01 and the push hit 100% of replicas, so the actual probability of correlated failure was ~0.01, eight orders of magnitude worse than the independence calculation. The independence assumption was false; the shared control plane was the hidden correlator. Fix: stage config pushes across AZs with at least 15 minutes between AZ rollouts, require explicit canary at 1-5% of each AZ, and pin deploys to one replica at a time within an AZ. Google calls this "deployment isolation," AWS calls it "staged rollout with AZ boundaries." Same idea.
Gray (real scenario): A load balancer in front of five backends serving 10,000 req/s starts silently dropping 1% of requests because its connection tracker overflowed at ~65k concurrent flows. Health checks (synthetic GET /health every 10s) pass; error-rate metric hovers at ~1.2% - which is below the 2% alert threshold. The support queue fills with "I pressed submit and nothing happened" tickets from ~100 req/s of real user actions. Engineering blames the client; the client blames engineering. For 3 hours, the gap between internal metrics (1.2% errors) and user experience (form submit sometimes fails) is invisible in the dashboards. Fix: end-to-end synthetic probes from a user-geography vantage point that exercise the full login-to-submit flow every minute; client-side error reporting (Sentry, DataDog RUM) on failed requests; alert on sustained low-level error rate above baseline (the 2% threshold was for known bad; the alert should fire on "higher than last week").
Common Confusion / Misconception
"Our availability is 99.9 * 99.9 = 99.8%." Only if failures are independent. In the correlated case above, you have 99.9% period - plus a non-trivial tail of simultaneous-outage events the multiplication does not capture. Werner Vogels says it cleanly: "everything fails, all the time." Treat independence as a claim that requires evidence, not a default.
"Our dashboards are green, so we are healthy." Gray failures exist precisely because the dashboard is computed from the same vantage point as the failure is hiding in. The gray-failure research paper (Microsoft, 2017) defines the pattern as a differential observability problem: internal observers see health, external observers see failure. The fix is not "better dashboards"; it is "observers on the outside."
"Retries make us more reliable." Retries during cascading failure are the classic amplifier: a struggling service gets hit three times as hard by well-intentioned clients. The mitigation is exponential backoff with jitter - but the better mitigation is a circuit breaker that stops retrying when the dependency is clearly down. AWS's Builders' Library piece "Timeouts, retries, and backoff with jitter" calls out retry storms as one of the top-three cloud failure amplifiers; naive retry loops are worse than no retry at all.
"We add redundancy, so we are fault-tolerant." Redundancy only tolerates the faults it was designed for. Three AZs help against AZ power loss but not against a shared deploy pipeline, a shared config service, a shared cert authority, or a software bug that hits every replica identically. The question is not "are we redundant" but "what set of faults are we redundant against."
"A rare failure is OK to ignore." Low-probability failures compound with high-cost consequences. A correlated outage once per year that takes the service down for 4 hours costs you 4 hours / 8760 hours = 0.046% availability - which alone puts a 99.95% SLO in jeopardy.
How To Use It
Before approving any design:
- Draw the dependency graph. For every edge, ask: what happens if the callee slows down or fails? If the answer is "we also fail," cascading is live - add timeouts, circuit breakers, bulkheads.
- List the shared failure domains. Same AZ, same rack, same control plane, same DNS, same certificate authority, same config pipeline. Each shared domain is a correlated-failure vector. Reduce or stage rollouts.
- Ask where gray failures can hide. Can a request succeed at the LB and fail at the backend without anyone seeing? Can a partial connection pool exhaustion mask as
p99? Instrument from both sides. - Measure from the user's side. If your only signals come from the same machines that can be sick, you cannot detect a gray failure.
- Set timeouts shorter than the caller's patience. A timeout longer than the caller will wait is useless; it just burns resources. The rule of thumb: each hop's timeout should be ~2-3x its p99, and the deepest hop's timeout must leave room for all shallower hops.
- Prefer fail-fast over fail-silent. A
503with clear semantics is recoverable; a hung request is not. Every error path should have an observable terminal state. - Make the bulkhead real. "We use separate thread pools" means a pool per dependency, with a visible metric per pool. "Our framework does it" is not a bulkhead if you cannot point to the metric.
- Rehearse the mitigation. Circuit breakers, fallbacks, and failovers you have not executed in the last 90 days have a
~50%chance of not working when you need them. Schedule a regular drill.
Check Yourself
- Name one mechanism that prevents cascading failure on a request path and one that prevents it on an async/queue path.
- A team deploys to three "independent" AZs with a shared config service. What class of failure are they exposed to, and what is the mitigation?
- Why does "low error rate from the server" not disprove a gray failure?
- Given a caller with a
1spatience budget and a 5-hop call graph, what is a reasonable timeout for each hop? - Why does an unbounded retry storm amplify a cascading failure, and what is the single most impactful mitigation?
- Your 3-replica database claims
99.999%availability from independence math. Name three shared failure domains that invalidate the math. - Give one example of a gray failure you have personally experienced (or read about in a postmortem), and identify the differential-observability gap.
Mini Drill or Application
Take a real or hypothetical service. For each of cascading, correlated, and gray failure, write: "Our exposure is _", "Our current mitigation is _", "Our residual risk is _". Blank "mitigation" cells are your next reliability ticket.
Extension drill: correlated-failure archaeology. Pick your single most-critical dependency graph (up to ~8 services). For each pair of services, ask: "do these share any of {power, network, deploy pipeline, config service, certificate authority, DNS, metrics-collection, log-aggregator, identity service, shared database, shared cache, shared auth library}?" Count shared domains. Any pair sharing 3+ domains is a correlated-failure time bomb. The usual answer is "most of our 'independent' services share at least 5 domains" - which is fine as long as you have named the risk and paid for the mitigation (staged rollouts, canaries, replicated controllers, etc).
Extension drill: cascading-path mapping. Take your dependency graph. For each outbound edge from a service, write down: (1) the timeout configured, (2) whether a circuit breaker exists, (3) whether a bulkhead exists, (4) what happens if the callee returns 5xx for 60 seconds. Most teams find that 70%+ of their edges fail at step 4 with "we stop serving requests that do not even touch this callee." That is the size of their cascading-failure attack surface, expressed as a count.
Metastability, a bonus failure mode. Not among the three above, but closely related: a metastable failure is when the system falls into a persistent bad state that remains even after the trigger goes away. Classic example: retry storms push a service over capacity; even after new arrivals stop, the queue of retries keeps it saturated. Marc Brooker and Nathan Bronson's Metastable failures paper (HotOS 2021) is the canonical reference. Treat it as a fourth category when a post-incident review uncovers "it fixed itself, then broke again when we restored traffic."
Gray-failure detection arithmetic. The cheapest viable gray-failure detector is a synthetic probe hitting the real user path from outside your infrastructure. Budget: one probe per minute per region per critical journey. At 3 regions * 4 journeys * 1440 minutes/day = 17,280 probes per day = trivially cheap. Compare against the alternative: a 15-minute gray-failure window at 10k RPS means 9 million real user requests experienced the degradation before you noticed. The synthetic probe costs pennies; the delayed detection costs your SLO.
Transfer / Where This Shows Up Later
Failure-mode thinking is how you stop being surprised. The three patterns name most of what will ever actually go wrong.
- This module, concept 07 (SLOs): every incident's cost is measured in error budget; cascading/correlated outages burn the budget fastest.
- This module, concept 09 (chaos engineering): chaos exists to validate that your mitigations work against these patterns, not to impress people.
- This module, concept 11 (load shedding): load shedding is the primary defense against cascading; admission control is how you shed fast.
- This module, concept 14 (incident lifecycle): the triage question "which pattern is this?" is often the single most load-bearing decision early in an incident.
- S8 M3 (event-driven): async queues isolate failure domains if they are bounded; unbounded queues turn cascading into slow-death by backlog.
- S9 M3 (Kubernetes): pod anti-affinity, PodDisruptionBudgets, and per-AZ failure domains are how Kubernetes expresses correlated-failure mitigations.
- S10 M4 (operational readiness): the readiness review probes for all three patterns explicitly. "Show me the gray-failure detection" is the question that earns or loses approval.
A leadership-level summary: when you join a new team and ask "what broke us last year," the answer is almost always a cascading or correlated failure that was cheap to prevent and expensive to suffer. The mitigations are cheap relative to the outage; they are expensive relative to the average engineer's patience. Part of your job is to keep the mitigations funded between incidents, because the team's felt need for them decays faster than the calendar cycle that produces the next outage.
One more notation to carry: when a postmortem says "root cause was X," substitute "the proximate trigger was X, and the structural conditions that made X catastrophic were Y and Z." That substitution is what makes the retrospective produce action items instead of blame.
The diagram is redundant with the text deliberately: during an incident you will remember the picture before you remember the paragraph.
A final cultural observation: organizations that take failure modes seriously name them in their postmortems. "This was a correlated failure via the config service" is a much more actionable sentence than "this was a reliability issue." Spread the vocabulary; the response quality follows.
Read This Only If Stuck
Local chunks (book anchors)
- System Design Primer: Availability Patterns -- failover, replication, and the serial/parallel availability formulas. Know where these assume independence.
- System Design Primer: CAP Theorem -- the fundamental trade-off; gray failures hide in the "C vs A" gap.
- System Design Primer: Consistency Patterns -- gray failures often manifest as inconsistency; bounded-staleness contracts are the shape of their mitigation.
- FoSA: Asynchronous Capabilities -- how async decoupling prevents cascading.
- FoSA: Preventing Data Loss -- the chapter on durability guarantees that matter when a failure propagates.
- FoSA: Operations and DevOps -- the operational-design chapter; names the correlations most teams miss.
External canonical references
- Richard Cook, How Complex Systems Fail -- 18 short paragraphs; required reading, especially points 3, 6, 10, and 14.
- Microsoft Research, Gray Failure: The Achilles' Heel of Cloud-Scale Systems (HotOS 2017) -- the paper that named the pattern; read the "differential observability" definition.
- Jeff Dean and Luiz Barroso, The Tail at Scale -- hedged and tied requests as mitigations for gray-ish tail failures.
- AWS Builders' Library, Timeouts, retries, and backoff with jitter -- the canonical writeup on retry storms.
- AWS Builders' Library, Avoiding insurmountable queue backlogs -- the cascading-via-queue pattern, with specific production stories.
- Aphyr (Kyle Kingsbury), Jepsen reports -- the industry reference for correlated-failure testing of distributed databases; read any three to calibrate your skepticism.
- Netflix Tech Blog, Failure injection: Chaos Engineering upgrades -- the practitioners' playbook for exercising these patterns.