Resilience Clinic

Retrieval Prompts

Name the four resilience primitives from concept 12.
State the three circuit-breaker states and the transitions between them.
State the rule that makes retries safe to add.
State what a bulkhead protects that a circuit breaker does not.
State why a timeout budget must shrink with depth.

Compare and Distinguish

Separate these pairs clearly:

timeout vs circuit breaker
circuit breaker vs bulkhead
retry with jitter vs retry with fixed backoff
synchronous fan-out vs asynchronous fan-out
server-side discovery vs client-side discovery

Common Mistake Check

For each statement, identify the failure mode:

"We added retries to every HTTP call. No backoff needed -- the downstream will recover."
"We have timeouts set to 30 seconds on every internal call."
"Our gateway times out at 60s, so downstream services use 50s timeouts."
"We retry on every 5xx response, including POST /charge."
"We removed the circuit breaker -- it was 'too aggressive' when the downstream was only a little slow."

Mini Application: Resilience Spec for One Call Path

Take a sync call path from your decomposition. Example: BFF -> Orders -> Payments -> external PSP (payment service provider).

Produce, in 45 minutes:

1. Timeout Budget Table

Layer	Budget	Reason
User-perceived	2000ms	acceptable p99 for checkout
Gateway -> BFF	1800ms	subtract gateway overhead
BFF -> Orders	?
Orders -> Payments	?
Payments -> PSP	?

Fill all rows. Each budget must be strictly tighter than its caller's, with room for one retry.

2. Retry Policy Per Hop

For each call, specify:

max retries
backoff schedule (e.g., 50ms, 200ms, 800ms)
jitter percentage
idempotent? if not, how is it made idempotent (idempotency key, dedup window)

3. Circuit Breaker Configuration Per Hop

failure threshold (count or %)
rolling window (seconds)
open-state cooldown
half-open probe count
fallback behavior when open

4. Bulkhead Configuration

pool size or concurrency limit per downstream
what happens when the limit is hit (fast-fail vs queue)

5. Fallback Matrix

For each hop, describe what the caller does if the downstream is circuit-open:

cached last-good value
degraded response ("payment pending, try again")
error propagation with correlation ID
emit compensating event

6. Sequence Diagram

Draw a mermaid sequence diagram for the path under partial failure: payments is slow, circuit opens, user sees degraded response, then payments recovers and the circuit closes. Include the timeout, the fast-fail, and the half-open probe.

7. Observability Pairing

For each primitive, specify the observability signal that tells you it fired:

timeout -> metric name + dashboard panel
retry -> metric + log field
circuit-open -> alert rule
bulkhead rejection -> metric

Evidence Check

This page is complete only if you can:

draw the call path with all four primitives from memory
defend every timeout number against the one above it
name a fallback for every possible open-circuit state
describe what changes on the trace waterfall when each primitive fires

If any of those are shaky, run the mini application on a second call path.

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

Mini Application: Resilience Spec for One Call Path​

1. Timeout Budget Table​

2. Retry Policy Per Hop​

3. Circuit Breaker Configuration Per Hop​

4. Bulkhead Configuration​

5. Fallback Matrix​

6. Sequence Diagram​

7. Observability Pairing​

Evidence Check​