Reliability and SLO Clinic
Retrieval Prompts
- Define SLI, SLO, and error budget, and state the identity
budget = 1 − SLO. - Give one correct SLI for availability and one for latency, each as a ratio of events.
- State the three failure modes distinguished in this module (cascading, correlated, gray) and give the single-sentence definition of each.
- State the five Chaos Engineering principles from
principlesofchaos.org. - State the two common mitigations for cascading failure at a service boundary.
Compare and Distinguish
- SLA vs SLO vs SLI
- error budget vs uptime percentage
- cascading failure vs correlated failure
- gray failure vs partial failure
- chaos experiment vs load test
- circuit breaker vs timeout
Common Mistake Check
For each, identify the error:
- "Our SLO is 99.999% because that's what AWS offers."
- "Availability was 99.93% this month, so we're within the 99.9% SLO."
- "We have redundancy across three servers, so correlated failure is impossible."
- "Chaos engineering is just deliberately breaking things in production."
- "The dashboard is green so nothing is wrong."
SLI/SLO Drafting Exercise
Pick an API you know (or one of: user-facing search endpoint, checkout/payment API, file upload). For it, produce:
- Three candidate SLIs in the
good_events / total_eventsform. Cover availability, latency, and one quality signal. - A recommended SLO for each, with a measurement window (for example, rolling 30 days).
- The error budget calculation (in minutes over 30 days for availability, in percentage for latency).
- A burn-rate alert design: what multiple of the budget-burn rate triggers a page vs a ticket?
- One UX consideration that your SLIs miss and how you might capture it.
Failure-Mode Post-hoc Analysis
For each scenario, identify the dominant failure mode (cascading, correlated, or gray) and the single change that most reduces recurrence:
- A single slow downstream dependency causes thread pools to exhaust across all upstream callers; error rate climbs from
0%to35%in five minutes. - A bad deploy to three replicas of a stateless service is deployed simultaneously; all three crash-loop; the service is down for four minutes.
- The load balancer's health check passes (TCP connect works), but
30%of requests to one backend time out silently at60s. The green dashboard hides the problem for two hours. - A DNS misconfiguration affects every pod in a single region at once; no single pod is "broken."
- A Redis cluster upgrade triggers a five-minute connection-refuse burst on every client; an internal reconnect loop amplifies the load during the outage.
Chaos Experiment Design
Design, on paper, one chaos experiment for each of these:
- Network partition: partition a service from its database for 30 seconds.
- Latency injection: inject 500ms of latency into
5%of requests to the recommendation service. - Instance termination: kill one of three instances of a stateful service.
For each, specify: (a) hypothesis about steady state, (b) exact blast radius, (c) abort conditions, (d) expected signals to validate hypothesis, (e) rollback procedure.
Burn-Rate Alert Math
You have an SLO of 99.9% availability over 30 days. That's 43.2 minutes of budget.
- You burn the entire budget in 1 hour. What is the burn rate (in "months of budget per hour" or equivalent)?
- Design a two-level alert: fast (page in minutes) and slow (ticket in hours). What burn-rate thresholds and windows would you use? Show the math.
- If the SLO is 99.99% instead (4.32 minutes of budget), how does your alert window change?
- Why does a slow burn over a week deserve a response even if it never "violates" the SLO this month?
Evidence Check
This practice page is complete only if you can:
- Draft a well-formed SLI for a real service in under 5 minutes.
- Compute an error budget and a burn-rate alert threshold from the SLO.
- Classify a real incident into cascading/correlated/gray and name its dominant mitigation.
- Design a chaos experiment with explicit hypothesis, blast radius, and abort criteria.