Skip to main content

Reliability and SLO Clinic

Retrieval Prompts

  1. Define SLI, SLO, and error budget, and state the identity budget = 1 − SLO.
  2. Give one correct SLI for availability and one for latency, each as a ratio of events.
  3. State the three failure modes distinguished in this module (cascading, correlated, gray) and give the single-sentence definition of each.
  4. State the five Chaos Engineering principles from principlesofchaos.org.
  5. State the two common mitigations for cascading failure at a service boundary.

Compare and Distinguish

  • SLA vs SLO vs SLI
  • error budget vs uptime percentage
  • cascading failure vs correlated failure
  • gray failure vs partial failure
  • chaos experiment vs load test
  • circuit breaker vs timeout

Common Mistake Check

For each, identify the error:

  1. "Our SLO is 99.999% because that's what AWS offers."
  2. "Availability was 99.93% this month, so we're within the 99.9% SLO."
  3. "We have redundancy across three servers, so correlated failure is impossible."
  4. "Chaos engineering is just deliberately breaking things in production."
  5. "The dashboard is green so nothing is wrong."

SLI/SLO Drafting Exercise

Pick an API you know (or one of: user-facing search endpoint, checkout/payment API, file upload). For it, produce:

  1. Three candidate SLIs in the good_events / total_events form. Cover availability, latency, and one quality signal.
  2. A recommended SLO for each, with a measurement window (for example, rolling 30 days).
  3. The error budget calculation (in minutes over 30 days for availability, in percentage for latency).
  4. A burn-rate alert design: what multiple of the budget-burn rate triggers a page vs a ticket?
  5. One UX consideration that your SLIs miss and how you might capture it.

Failure-Mode Post-hoc Analysis

For each scenario, identify the dominant failure mode (cascading, correlated, or gray) and the single change that most reduces recurrence:

  1. A single slow downstream dependency causes thread pools to exhaust across all upstream callers; error rate climbs from 0% to 35% in five minutes.
  2. A bad deploy to three replicas of a stateless service is deployed simultaneously; all three crash-loop; the service is down for four minutes.
  3. The load balancer's health check passes (TCP connect works), but 30% of requests to one backend time out silently at 60s. The green dashboard hides the problem for two hours.
  4. A DNS misconfiguration affects every pod in a single region at once; no single pod is "broken."
  5. A Redis cluster upgrade triggers a five-minute connection-refuse burst on every client; an internal reconnect loop amplifies the load during the outage.

Chaos Experiment Design

Design, on paper, one chaos experiment for each of these:

  1. Network partition: partition a service from its database for 30 seconds.
  2. Latency injection: inject 500ms of latency into 5% of requests to the recommendation service.
  3. Instance termination: kill one of three instances of a stateful service.

For each, specify: (a) hypothesis about steady state, (b) exact blast radius, (c) abort conditions, (d) expected signals to validate hypothesis, (e) rollback procedure.

Burn-Rate Alert Math

You have an SLO of 99.9% availability over 30 days. That's 43.2 minutes of budget.

  1. You burn the entire budget in 1 hour. What is the burn rate (in "months of budget per hour" or equivalent)?
  2. Design a two-level alert: fast (page in minutes) and slow (ticket in hours). What burn-rate thresholds and windows would you use? Show the math.
  3. If the SLO is 99.99% instead (4.32 minutes of budget), how does your alert window change?
  4. Why does a slow burn over a week deserve a response even if it never "violates" the SLO this month?

Evidence Check

This practice page is complete only if you can:

  • Draft a well-formed SLI for a real service in under 5 minutes.
  • Compute an error budget and a burn-rate alert threshold from the SLO.
  • Classify a real incident into cascading/correlated/gray and name its dominant mitigation.
  • Design a chaos experiment with explicit hypothesis, blast radius, and abort criteria.