Skip to main content

Stress Test Clinic

Retrieval Prompts

  1. Describe the 10×-100× walk protocol in five steps.
  2. Name the four columns of the failure-walk table.
  3. Give one example of a bottleneck that is not a SPOF, and one example of a SPOF that is not a bottleneck.
  4. What is the structural difference between a "crash-only" failure and a "slow" failure, and why does slow hurt more?
  5. What is the danger of retry without backoff and jitter?

Compare and Distinguish

Separate cleanly:

  • horizontal scaling vs vertical scaling
  • bottleneck (now) vs bottleneck (projected)
  • dead component vs slow component
  • hot key vs hot shard vs hot partition
  • accepted SPOF vs unidentified SPOF

Produce a one-sentence distinction for each pair.

Common Mistake Check

Identify the flaw:

  1. "Auto-scaling handles 10×." (What is this statement avoiding?)
  2. "Replicas eliminate the SPOF." (Give two counter-examples.)
  3. "Our P99 is 50 ms, so we're fine." (What is missing?)
  4. "We'll add a circuit breaker later." (What class of failures does that decision accept right now?)
  5. "The cache is 99.999% available, so we can rely on it." (What does cache failure do to downstream origin load?)

Mini Application

Take two diagrams from the High-Level Design Workshop (practice 2). For each:

10× walk

Walk the hot-path components in order. For each:

  • name the sizing constraint (CPU, memory, disk, network, connections, partition count, fan-out)
  • state whether it fits at 10× traffic
  • for the first component that does not fit, propose the minimum structural change

Write this out as a numbered list; do not hand-wave.

100× walk

Re-do the walk for 100× the original traffic. Expect at least two structural changes per design. Capture them in order.

Failure walk (per-box)

Produce the four-column table (impact, blast radius, recovery, TTR) for every box in the diagram. Add entries for:

  • a single AZ outage (all components in that AZ dead simultaneously)
  • a full region outage
  • a network partition between the primary and replica regions

Ranked bottlenecks and SPOFs

Produce the two ranked lists. For each entry, mark one of: fix-now, fix-in-phase-2, accept-with-reason. For every accept-with-reason, write the exact sentence you would put in a design doc.

Dogpile / Cascade Drills

Walk these specific scenarios for one of your designs:

  1. Cold cache: the entire cache cluster restarts during peak. What happens?
  2. Retry storm: a downstream service returns 500s for 10 seconds; clients retry 3× with no jitter. What happens?
  3. Poison pill: a single malformed message in a Kafka topic blocks the consumer. What happens?
  4. Slow failover: the primary DB dies and the promotion takes 3 minutes instead of 30 seconds. What happens?

For each, describe the user-visible effect and one mitigation.

Evidence Check

This page is complete only if:

  • you produced written 10× and 100× walks for two designs, with ranked structural changes
  • you produced a per-box failure walk table with AZ- and region-level rows
  • you produced a ranked bottleneck list and a ranked SPOF list for each design
  • every "accept-with-reason" entry has a concrete sentence you could defend in review
  • you walked at least two dogpile/cascade scenarios and named one mitigation for each