Resilience Clinic
Retrieval Prompts
- Name the four resilience primitives from concept 12.
- State the three circuit-breaker states and the transitions between them.
- State the rule that makes retries safe to add.
- State what a bulkhead protects that a circuit breaker does not.
- State why a timeout budget must shrink with depth.
Compare and Distinguish
Separate these pairs clearly:
- timeout vs circuit breaker
- circuit breaker vs bulkhead
- retry with jitter vs retry with fixed backoff
- synchronous fan-out vs asynchronous fan-out
- server-side discovery vs client-side discovery
Common Mistake Check
For each statement, identify the failure mode:
- "We added retries to every HTTP call. No backoff needed -- the downstream will recover."
- "We have timeouts set to 30 seconds on every internal call."
- "Our gateway times out at 60s, so downstream services use 50s timeouts."
- "We retry on every 5xx response, including
POST /charge." - "We removed the circuit breaker -- it was 'too aggressive' when the downstream was only a little slow."
Mini Application: Resilience Spec for One Call Path
Take a sync call path from your decomposition. Example: BFF -> Orders -> Payments -> external PSP (payment service provider).
Produce, in 45 minutes:
1. Timeout Budget Table
| Layer | Budget | Reason |
|---|---|---|
| User-perceived | 2000ms | acceptable p99 for checkout |
| Gateway -> BFF | 1800ms | subtract gateway overhead |
| BFF -> Orders | ? | |
| Orders -> Payments | ? | |
| Payments -> PSP | ? |
Fill all rows. Each budget must be strictly tighter than its caller's, with room for one retry.
2. Retry Policy Per Hop
For each call, specify:
- max retries
- backoff schedule (e.g., 50ms, 200ms, 800ms)
- jitter percentage
- idempotent? if not, how is it made idempotent (idempotency key, dedup window)
3. Circuit Breaker Configuration Per Hop
- failure threshold (count or %)
- rolling window (seconds)
- open-state cooldown
- half-open probe count
- fallback behavior when open
4. Bulkhead Configuration
- pool size or concurrency limit per downstream
- what happens when the limit is hit (fast-fail vs queue)
5. Fallback Matrix
For each hop, describe what the caller does if the downstream is circuit-open:
- cached last-good value
- degraded response ("payment pending, try again")
- error propagation with correlation ID
- emit compensating event
6. Sequence Diagram
Draw a mermaid sequence diagram for the path under partial failure: payments is slow, circuit opens, user sees degraded response, then payments recovers and the circuit closes. Include the timeout, the fast-fail, and the half-open probe.
7. Observability Pairing
For each primitive, specify the observability signal that tells you it fired:
- timeout -> metric name + dashboard panel
- retry -> metric + log field
- circuit-open -> alert rule
- bulkhead rejection -> metric
Evidence Check
This page is complete only if you can:
- draw the call path with all four primitives from memory
- defend every timeout number against the one above it
- name a fallback for every possible open-circuit state
- describe what changes on the trace waterfall when each primitive fires
If any of those are shaky, run the mini application on a second call path.