Mitigations: Retry, Circuit Breaker, Degraded Mode
What This Concept Is
Three canonical reliability patterns, each with a different job. For every external dependency in your capstone, you are making a choice -- explicitly or by accident -- between them.
- Retry says: "the call might fail transiently; try again with backoff and a cap."
- Circuit breaker says: "the dependency is in a bad state; stop calling it for a while so we do not amplify the failure."
- Degraded mode (graceful degradation) says: "when the dependency is unavailable, serve a reduced-but-useful response instead of failing."
They are not alternatives; they are layers. A well-built client retries transient errors, opens a breaker when retries stop helping, and when the breaker is open, falls through to a degraded response. The decision you actually make per dependency is which layers you implement and how they are tuned.
Two other patterns sit alongside these and should be named even when you do not implement them: bulkheads (isolate thread/connection pools so one slow dependency cannot starve the others) and timeouts (the precondition for every other pattern -- without a timeout, retries and breakers are wishes, not mechanisms). If you write "no timeout" in the table below, you have not specified a dependency relationship; you have specified a bug.
Why It Matters Here (In the Capstone)
Capstones frequently cascade: a slow downstream causes retries, retries saturate the pool, the pool exhausts, the API latency spikes, the SLO burns, and a single dependency has taken down a system that did not need to go down with it. Each of the three patterns breaks the cascade at a different point.
The goal of this concept is to force you to write down the decision per dependency, so the next operator (possibly you, six months later) can tell whether the absence of a breaker is intentional or an oversight. Absence-by-decision and absence-by-oversight are indistinguishable to an examiner unless you wrote it down.
Concrete Example -- from a real capstone
The webhook-handler capstone has three external dependencies:
- Postgres (primary DB)
- Kafka/SQS (queue)
- Notification API (downstream)
Decision table in library/raw/reliability-decisions.md:
| Dependency | Timeout | Retry | Circuit breaker | Degraded mode | Rationale |
|---|---|---|---|---|---|
| Postgres | 2s | Yes: 3 attempts, jittered exponential backoff (50ms, 150ms, 450ms) on OperationalError only | No | No (read path); Yes (write path: buffer to local WAL, drain when DB returns, max 5 min) | Transient DB hiccups are common; long outages are caught by the alert and handled by humans, not by client logic |
| Queue (SQS) | 5s | Yes: SDK retries (default 3) on 5xx/throttle | Yes: open breaker after 10 consecutive failures in 30s; half-open probe every 30s | Yes: if breaker open, return 503 to client with Retry-After -- do not swallow | Queue outage is a user-visible event; no useful degraded mode beyond visible 503 |
| Notification API | 3s | Yes: 2 attempts, 500ms + 1500ms, only on 5xx and timeouts | Yes: open after 20% error rate over 60s sliding window; stay open 60s | Yes: async failure -- mark event notify_pending=true, background job retries with full backoff for 24h | Notifications are important but not blocking for the webhook ACK; asynchrony is the degradation |
What this buys us:
- A 30-second Postgres hiccup: retries absorb it, no user impact.
- A 10-minute notification-API outage: breaker opens fast, webhook still ACKs 200, user sees no error; background retry catches up when the API returns.
- A queue failure: user sees an immediate 503 with a sensible retry hint, not a 90-second timeout.
That is three different decisions for three different dependencies, each with a one-sentence rationale. You cannot defend a single pattern across all dependencies.
A minimal code shape (Python, using tenacity):
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
@retry(
stop=stop_after_attempt(2),
wait=wait_exponential_jitter(initial=0.5, max=1.5),
retry=retry_if_exception_type((TimeoutError, Notifier5xxError)),
reraise=True,
)
def call_notification_api(event):
with breaker.protect():
return notifier.send(event, timeout=3)
The breaker.protect() context wraps the tenacity-retry in a circuit breaker; the caller above then falls through to the async-background fallback on breaker.OpenError. Thirty lines total, defensible at PRR.
Common Confusion / Misconceptions
"Retry everything; it's free." Retries are not free. They multiply load on a struggling dependency, extend tail latency, and can deadlock when both sides retry. Retry only on idempotent operations, only on retryable errors, with a cap and jitter. And never retry a timeout at the same layer twice without an exponential backoff.
"Circuit breakers are complicated; let's not." The minimal breaker is 30 lines of code: a counter of recent failures, an open flag, a timer to probe. It is worth it on any dependency whose failure can trigger retry amplification.
"Degraded mode means feature flags." Feature flags can implement degraded mode, but degraded mode is a broader idea: "when the ideal path is unavailable, what is the useful-enough path?" Examples: return last-known-good cached data, accept the request and promise delivery later, serve a smaller payload.
"If a dependency goes down, we go down with it." That is the absence of a mitigation, which is a fine choice for some dependencies (your own database, usually) but should be an explicit one. Write "no degraded mode; service returns 503" in the decision table -- do not let it be an oversight.
"Retry counts in the Stripe SDK are good defaults." SDK defaults are designed for the median caller. Your caller is specific. Read the SDK's retry policy and decide whether it suits your budget (some SDKs retry 5 times by default, which can triple your tail latency under stress).
How To Use It (In Your Capstone)
- List every external dependency of your capstone (DB, queue, object store, every third-party API).
- Set a timeout on each call. No timeout is a bug; this is the non-negotiable first step.
- For each, write a row: retry policy? breaker? degraded mode? one-sentence rationale.
- Implement the minimal version of each decision in code. Use a library (
tenacity,polly,resilience4j) -- do not roll your own retry unless you must. - Add an integration test that forces failure in staging (chaos-lite: block the dependency, observe the system) and verifies the decision holds.
- Link the decision table from the runbooks for each dependency.
- Re-review the table whenever a dependency is added, removed, or changed.
See also (integrative)
- S8 M04 Cluster 3 -- failure modes (cascading, correlated, gray): the taxonomy these patterns defend against:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-03-reliability-engineering/08-failure-modes-cascading-correlated-gray-primary.md. - S8 M02 Cluster 4 -- resilience: timeouts, retries, circuit breakers, bulkheads in the service-communication context:
../../../../semester-08-system-design-leadership/module-02-microservices-service-decomposition/concepts/cluster-04-service-communication/12-resilience-timeouts-retries-circuit-breakers-bulkheads-primary.md. - S8 M04 Cluster 4 -- load shedding, rate limiting, admission control: the intake-side counterparts to these outbound-side mitigations:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-04-capacity-planning-and-load/11-load-shedding-rate-limiting-and-admission-control-primary.md. - S6 M05 Distributed Systems Fundamentals -- timeouts, partial failure, and why naive retry amplifies incidents in distributed systems.
- Microsoft Azure Architecture Center -- Circuit Breaker Pattern -- canonical closed / open / half-open state machine and pattern considerations.
- Microsoft Azure Architecture Center -- Retry Pattern -- decision guide for when retries help and when they hurt.
- AWS Builders' Library -- Timeouts, retries, and backoff with jitter -- jitter math and operational war stories.
- Google SRE Book -- Addressing Cascading Failures -- the single best reference for why these patterns exist.
Check Yourself
- Why is retrying a non-idempotent operation dangerous even when the error appears transient?
- What problem does a circuit breaker solve that retries alone cannot?
- Give one example of degraded mode that does not involve caching.
- Why is setting a timeout a precondition for every other pattern in this table?
- Under what conditions might adding retry actually make your SLO worse rather than better?
- What is the decision-table row for the dependency in your capstone that has no breaker -- is the absence deliberate or an oversight?
Mini Drill or Application (Capstone-scoped)
- Build the dependency table above for your capstone. Be specific about timeouts, attempt counts, and breaker thresholds.
- Pick the dependency with the highest likelihood * impact (often an external API). Implement retry + breaker + one degraded fallback.
- Write an integration test that turns off the dependency in staging and asserts:
- client retries ≤ the configured max
- breaker opens within the configured window
- degraded mode activates and returns the fallback
- Commit the table, code, and test. Link from the SLO doc and the runbook.
- For one dependency, deliberately decide "no retry, no breaker, no degraded mode." Write the rationale in the table. This proves the table is a decision-making artifact, not a checklist.
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.