Resilience: Timeouts, Retries, Circuit Breakers, Bulkheads
What This Concept Is
Four primitives that stop one slow or failing dependency from cascading into a full outage. Each addresses a different failure mode; you need all four for a robust sync call path.
- Timeouts. Never wait forever. Every network call has an explicit timeout, tighter than the caller's own timeout budget.
- Retries with backoff and jitter. On transient failures, try again -- but with exponential backoff and random jitter so you do not stampede a recovering service.
- Circuit breaker. When a downstream is consistently failing, stop calling it for a while. Fail fast, give it time to recover, and protect your own threads/connections.
- Bulkheads. Partition your resources (threads, connections, queues) by downstream so a failure in one downstream cannot starve all your capacity for calls to others.
These are the Release It! primitives (Michael Nygard's book), and they are the canonical vocabulary on every mature team.
Why It Matters Here
Without these, a microservices architecture is less reliable than the monolith it replaced. A slow dependency eight hops away can drain threads, exhaust connection pools, and turn a local network blip into a total outage. With them, the same dependency failing produces localized, observable, recoverable degradation.
Sequence Diagram: All Four Primitives at Work
Suppose Service A calls Service B. B is currently failing slowly.
Notes:
- Timeout catches the slow failure; without it, A would block forever.
- Bulkhead (the pool dedicated to A->B) prevents B's slowness from consuming every connection A has, including those meant for C and D.
- Circuit breaker fast-fails after repeated errors so A stops queueing doomed requests and protects B from a thundering-herd on recovery.
- Retries (not shown for clarity) would sit between timeout and circuit breaker: on failure, retry once with backoff+jitter, then count as failure if it still fails.
Timeout Budgets
Timeouts are only meaningful if they are tighter as you go deeper. If the user-facing request budget is 2s, and the API gateway spends 100ms, the BFF gets 1.9s, and any downstream it calls should have a timeout < 1.9s with room for retries.
A simple budget:
| Layer | Timeout |
|---|---|
| Gateway -> user | 2000ms total |
| Gateway -> BFF | 1800ms |
| BFF -> service | 800ms per call (at most 2 parallel) |
| Service -> DB | 200ms |
| Service -> other service | 500ms |
When a downstream's p99 is 600ms and your timeout is 500ms, you need to either raise the timeout or fix the downstream. Never set timeouts higher than the user-visible budget minus a safety margin.
Retries: Rules
- Only retry idempotent operations. Retrying a
POST /chargeduplicates charges. Either make the operation idempotent (idempotency key) or do not retry. - Exponential backoff. First retry after 50ms, then 200ms, then 800ms. Do not retry faster than the downstream can plausibly recover.
- Jitter. Randomize the backoff by 0-50% so your retrying clients do not form a synchronized wave.
- Retry budget. Cap the total extra work retries can generate (e.g., "at most 10% of requests may be retries"). Without this, you DDoS the recovering downstream.
Circuit Breaker: States
- Closed. Calls flow normally. Failures are counted.
- Open. After a failure threshold (e.g., 10 failures in 30s, or 50% error rate over 20 requests), fast-fail without calling downstream for a cooldown window.
- Half-open. After cooldown, allow a small number of probe calls. If they succeed, close. If they fail, open again.
Use a well-tested library (Hystrix successor: resilience4j for JVM, Polly for .NET, opossum for Node.js, sentinel for Go). Do not roll your own.
Bulkheads
Two common implementations:
- Thread-pool bulkhead. Each downstream has its own thread pool. When that pool is exhausted, calls to that downstream fail fast instead of queuing.
- Semaphore bulkhead. Each downstream has a concurrency limit (e.g., at most 30 in-flight calls). Beyond that, fail fast.
The cheap version: your HTTP client pool per downstream, sized appropriately. The expensive version: a full service mesh with per-route concurrency limits.
Common Confusion / Misconception
"More retries = more resilience." False. Unbounded retries turn a small outage into a DDoS on the recovering service. Always use backoff, jitter, and a retry budget.
"We have a timeout, so we are fine." A timeout without a circuit breaker means every user request still pays the full timeout cost while the downstream is dead. The circuit breaker converts that to instant failure + fallback.
"Bulkheads are exotic." They are just "do not share one thread pool across all downstreams." Most HTTP clients let you configure per-host pools in a few lines.
How To Use It
For every synchronous inter-service call, specify these in code or config:
- Timeout (absolute, tighter than caller budget).
- Retry policy:
max = N,backoff = exponential,jitter = yes,only idempotent. - Circuit breaker: thresholds, cooldown, fallback behavior.
- Bulkhead: per-host pool size or concurrency limit.
- Fallback: cached last-good value, empty response, or a degraded mode.
Review these in production incident post-mortems. They tend to be wrong until you have been paged for them.
Check Yourself
- Why does a timeout alone fail to protect against a slow downstream under sustained failure?
- Why is retry-without-backoff worse than no retry?
- What is the difference between a bulkhead and a circuit breaker -- aren't they both "stop calling the bad thing"?
Mini Drill or Application
Take a single call path from a real system. In 15 minutes, specify all four primitives:
- timeout (ms)
- retry policy (max, backoff, jitter, idempotent?)
- circuit breaker (error %, window, cooldown)
- bulkhead (pool size or concurrency limit)
- fallback behavior on open circuit
How This Sits In The Module
This is the operational heart of microservices. If you cannot draw the diagram above from memory, do not call any system you build "production-ready". Concept 13 adds the observability to tell when these primitives are tripping.
SLO-Driven Timeouts
How do you pick a timeout value? Work backward from the SLO:
- User SLO. "99% of checkouts complete in 2000ms."
- Budget allocation. Allocate the 2000ms across the call tree. Gateway overhead 100ms, BFF overhead 100ms, each downstream gets a slice. The deepest downstream should have ~300-500ms to respond.
- Timeout = p99.9 of downstream. Set the timeout above the downstream's p99.9 latency so it fires only on genuinely degraded calls, not legitimate slow-path requests. If p99.9 exceeds the budget, the downstream is out of SLO and you have a capacity or performance problem, not a timeout problem.
- Retry adds budget. If you retry once with 50ms backoff, you need
(timeout + backoff) · 2 ≤ remaining budget. Most real-world retry budgets are tight; that is why only the cheapest hops should retry.
The SRE practice (Google SRE Book, chapter on handling overload) formalizes this as part of error budget management. The number you set in config is traceable to a user-visible commitment, not a gut feeling.
Fallbacks and Graceful Degradation
When the circuit is open, what does the caller return? Options, in order of preference:
- Cached last-good value. Product recommendations from yesterday are better than none.
- Empty or default value marked as degraded (the UI shows "recommendations unavailable").
- Bypass (skip the feature). The checkout proceeds without the loyalty points calculation; user sees "points will be credited later."
- User-facing error. Only when no degradation is possible (e.g., payment authorization -- the checkout must fail fast).
Graceful degradation is the payoff for the resilience work. Without fallbacks, all four primitives just make failures faster; with fallbacks, they make failures partial. Nygard's Release It! calls this "failing gracefully"; the SRE vocabulary is "graceful degradation mode."
Read This Only If Stuck
Local chunks
- FoSA: Preventing Data Loss (broader reliability) -- reliability framing that complements the four primitives.
- FoSA: Architecture Characteristics Ratings (resilience/reliability rows) -- resilience appears in the characteristic catalog.
- FoSA: Fitness Functions -- resilience budgets can be fitness functions (fail CI if timeouts exceed SLO).
- Primer: Availability Patterns -- the foundational availability vocabulary, including active-active, active-passive, fail-over.
- Primer: Performance vs Scalability -- why tail latency drives system-level SLOs.
- Primer: CAP Theorem -- partition tolerance is the reason these primitives exist.
- Primer: Latency vs Throughput -- vocabulary for the timeout budget conversation.
External canonical references
- Michael Nygard, Release It! (2nd ed.) -- the canonical book. Chapters on "Stability Patterns" and "Capacity Patterns" cover all four primitives and more; the book every senior engineer should own.
- Chris Richardson, Circuit Breaker pattern -- short pattern page.
- Martin Fowler, CircuitBreaker -- short bliki entry.
- Netflix Tech Blog, Making the Netflix API More Resilient (Hystrix origin) -- the practitioner origin story.
- resilience4j, Getting started -- the JVM reference implementation, with clear docs on each primitive.
- AWS Builders' Library, Timeouts, retries, and backoff with jitter -- one of the most-referenced modern treatments; authored by Marc Brooker.
- Google SRE Book, Handling Overload -- the load-shedding and adaptive throttling perspective.
- Google SRE Book, Addressing Cascading Failures -- the failure modes the four primitives prevent.
- InfoQ, Resilience Engineering: The What and How -- broader framing.
Depth Path
- Netflix Hystrix documentation and the resilience4j docs. Hystrix is archived but the ideas carry through. Opossum (Node.js) and Polly (.NET) mirror the same API.
- Sidney Dekker, Drift into Failure -- the sociotechnical framing of why these primitives are necessary but not sufficient. Read after a major incident where the primitives did not save you.