Skip to main content

Resilience Clinic

Retrieval Prompts

  1. Name the four resilience primitives from concept 12.
  2. State the three circuit-breaker states and the transitions between them.
  3. State the rule that makes retries safe to add.
  4. State what a bulkhead protects that a circuit breaker does not.
  5. State why a timeout budget must shrink with depth.

Compare and Distinguish

Separate these pairs clearly:

  • timeout vs circuit breaker
  • circuit breaker vs bulkhead
  • retry with jitter vs retry with fixed backoff
  • synchronous fan-out vs asynchronous fan-out
  • server-side discovery vs client-side discovery

Common Mistake Check

For each statement, identify the failure mode:

  1. "We added retries to every HTTP call. No backoff needed -- the downstream will recover."
  2. "We have timeouts set to 30 seconds on every internal call."
  3. "Our gateway times out at 60s, so downstream services use 50s timeouts."
  4. "We retry on every 5xx response, including POST /charge."
  5. "We removed the circuit breaker -- it was 'too aggressive' when the downstream was only a little slow."

Mini Application: Resilience Spec for One Call Path

Take a sync call path from your decomposition. Example: BFF -> Orders -> Payments -> external PSP (payment service provider).

Produce, in 45 minutes:

1. Timeout Budget Table

LayerBudgetReason
User-perceived2000msacceptable p99 for checkout
Gateway -> BFF1800mssubtract gateway overhead
BFF -> Orders?
Orders -> Payments?
Payments -> PSP?

Fill all rows. Each budget must be strictly tighter than its caller's, with room for one retry.

2. Retry Policy Per Hop

For each call, specify:

  • max retries
  • backoff schedule (e.g., 50ms, 200ms, 800ms)
  • jitter percentage
  • idempotent? if not, how is it made idempotent (idempotency key, dedup window)

3. Circuit Breaker Configuration Per Hop

  • failure threshold (count or %)
  • rolling window (seconds)
  • open-state cooldown
  • half-open probe count
  • fallback behavior when open

4. Bulkhead Configuration

  • pool size or concurrency limit per downstream
  • what happens when the limit is hit (fast-fail vs queue)

5. Fallback Matrix

For each hop, describe what the caller does if the downstream is circuit-open:

  • cached last-good value
  • degraded response ("payment pending, try again")
  • error propagation with correlation ID
  • emit compensating event

6. Sequence Diagram

Draw a mermaid sequence diagram for the path under partial failure: payments is slow, circuit opens, user sees degraded response, then payments recovers and the circuit closes. Include the timeout, the fast-fail, and the half-open probe.

7. Observability Pairing

For each primitive, specify the observability signal that tells you it fired:

  • timeout -> metric name + dashboard panel
  • retry -> metric + log field
  • circuit-open -> alert rule
  • bulkhead rejection -> metric

Evidence Check

This page is complete only if you can:

  • draw the call path with all four primitives from memory
  • defend every timeout number against the one above it
  • name a fallback for every possible open-circuit state
  • describe what changes on the trace waterfall when each primitive fires

If any of those are shaky, run the mini application on a second call path.