Stress Test Clinic
Retrieval Prompts
- Describe the 10×-100× walk protocol in five steps.
- Name the four columns of the failure-walk table.
- Give one example of a bottleneck that is not a SPOF, and one example of a SPOF that is not a bottleneck.
- What is the structural difference between a "crash-only" failure and a "slow" failure, and why does slow hurt more?
- What is the danger of retry without backoff and jitter?
Compare and Distinguish
Separate cleanly:
- horizontal scaling vs vertical scaling
- bottleneck (now) vs bottleneck (projected)
- dead component vs slow component
- hot key vs hot shard vs hot partition
- accepted SPOF vs unidentified SPOF
Produce a one-sentence distinction for each pair.
Common Mistake Check
Identify the flaw:
- "Auto-scaling handles 10×." (What is this statement avoiding?)
- "Replicas eliminate the SPOF." (Give two counter-examples.)
- "Our P99 is 50 ms, so we're fine." (What is missing?)
- "We'll add a circuit breaker later." (What class of failures does that decision accept right now?)
- "The cache is 99.999% available, so we can rely on it." (What does cache failure do to downstream origin load?)
Mini Application
Take two diagrams from the High-Level Design Workshop (practice 2). For each:
10× walk
Walk the hot-path components in order. For each:
- name the sizing constraint (CPU, memory, disk, network, connections, partition count, fan-out)
- state whether it fits at 10× traffic
- for the first component that does not fit, propose the minimum structural change
Write this out as a numbered list; do not hand-wave.
100× walk
Re-do the walk for 100× the original traffic. Expect at least two structural changes per design. Capture them in order.
Failure walk (per-box)
Produce the four-column table (impact, blast radius, recovery, TTR) for every box in the diagram. Add entries for:
- a single AZ outage (all components in that AZ dead simultaneously)
- a full region outage
- a network partition between the primary and replica regions
Ranked bottlenecks and SPOFs
Produce the two ranked lists. For each entry, mark one of: fix-now, fix-in-phase-2, accept-with-reason. For every accept-with-reason, write the exact sentence you would put in a design doc.
Dogpile / Cascade Drills
Walk these specific scenarios for one of your designs:
- Cold cache: the entire cache cluster restarts during peak. What happens?
- Retry storm: a downstream service returns 500s for 10 seconds; clients retry 3× with no jitter. What happens?
- Poison pill: a single malformed message in a Kafka topic blocks the consumer. What happens?
- Slow failover: the primary DB dies and the promotion takes 3 minutes instead of 30 seconds. What happens?
For each, describe the user-visible effect and one mitigation.
Evidence Check
This page is complete only if:
- you produced written 10× and 100× walks for two designs, with ranked structural changes
- you produced a per-box failure walk table with AZ- and region-level rows
- you produced a ranked bottleneck list and a ranked SPOF list for each design
- every "accept-with-reason" entry has a concrete sentence you could defend in review
- you walked at least two dogpile/cascade scenarios and named one mitigation for each