Reliability and SLO Clinic

Retrieval Prompts

Define SLI, SLO, and error budget, and state the identity budget = 1 − SLO.
Give one correct SLI for availability and one for latency, each as a ratio of events.
State the three failure modes distinguished in this module (cascading, correlated, gray) and give the single-sentence definition of each.
State the five Chaos Engineering principles from principlesofchaos.org.
State the two common mitigations for cascading failure at a service boundary.

Compare and Distinguish

SLA vs SLO vs SLI
error budget vs uptime percentage
cascading failure vs correlated failure
gray failure vs partial failure
chaos experiment vs load test
circuit breaker vs timeout

Common Mistake Check

For each, identify the error:

"Our SLO is 99.999% because that's what AWS offers."
"Availability was 99.93% this month, so we're within the 99.9% SLO."
"We have redundancy across three servers, so correlated failure is impossible."
"Chaos engineering is just deliberately breaking things in production."
"The dashboard is green so nothing is wrong."

SLI/SLO Drafting Exercise

Pick an API you know (or one of: user-facing search endpoint, checkout/payment API, file upload). For it, produce:

Three candidate SLIs in the good_events / total_events form. Cover availability, latency, and one quality signal.
A recommended SLO for each, with a measurement window (for example, rolling 30 days).
The error budget calculation (in minutes over 30 days for availability, in percentage for latency).
A burn-rate alert design: what multiple of the budget-burn rate triggers a page vs a ticket?
One UX consideration that your SLIs miss and how you might capture it.

Failure-Mode Post-hoc Analysis

For each scenario, identify the dominant failure mode (cascading, correlated, or gray) and the single change that most reduces recurrence:

A single slow downstream dependency causes thread pools to exhaust across all upstream callers; error rate climbs from 0% to 35% in five minutes.
A bad deploy to three replicas of a stateless service is deployed simultaneously; all three crash-loop; the service is down for four minutes.
The load balancer's health check passes (TCP connect works), but 30% of requests to one backend time out silently at 60s. The green dashboard hides the problem for two hours.
A DNS misconfiguration affects every pod in a single region at once; no single pod is "broken."
A Redis cluster upgrade triggers a five-minute connection-refuse burst on every client; an internal reconnect loop amplifies the load during the outage.

Chaos Experiment Design

Design, on paper, one chaos experiment for each of these:

Network partition: partition a service from its database for 30 seconds.
Latency injection: inject 500ms of latency into 5% of requests to the recommendation service.
Instance termination: kill one of three instances of a stateful service.

For each, specify: (a) hypothesis about steady state, (b) exact blast radius, (c) abort conditions, (d) expected signals to validate hypothesis, (e) rollback procedure.

Burn-Rate Alert Math

You have an SLO of 99.9% availability over 30 days. That's 43.2 minutes of budget.

You burn the entire budget in 1 hour. What is the burn rate (in "months of budget per hour" or equivalent)?
Design a two-level alert: fast (page in minutes) and slow (ticket in hours). What burn-rate thresholds and windows would you use? Show the math.
If the SLO is 99.99% instead (4.32 minutes of budget), how does your alert window change?
Why does a slow burn over a week deserve a response even if it never "violates" the SLO this month?

Evidence Check

This practice page is complete only if you can:

Draft a well-formed SLI for a real service in under 5 minutes.
Compute an error budget and a burn-rate alert threshold from the SLO.
Classify a real incident into cascading/correlated/gray and name its dominant mitigation.
Design a chaos experiment with explicit hypothesis, blast radius, and abort criteria.

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

SLI/SLO Drafting Exercise​

Failure-Mode Post-hoc Analysis​

Chaos Experiment Design​

Burn-Rate Alert Math​

Evidence Check​