Skip to main content

SLIs, SLOs, and Error Budgets

What This Concept Is

Three concepts from the Google SRE book form the contract between reliability and velocity.

  • SLI (Service Level Indicator): a measurable quantity that describes one aspect of service quality, usually written as a ratio of good events to valid events.
  • SLO (Service Level Objective): a numeric target for an SLI, measured over a time window, that the team commits to meet.
  • Error budget: 1 - SLO. The fraction of events you are allowed to handle badly before the SLO is at risk.

A realistic SLI expression for an HTTP API:

SLI_availability = count(requests with status in {200, 201, 202, 204}
AND latency < 500 ms) / count(valid requests)

A realistic SLO built on top:

Over any 30-day rolling window, SLI_availability >= 99.9%.

And the corresponding error budget:

Error budget = 1 - 0.999 = 0.001 = 0.1% of valid requests in the window.

If you got 100,000,000 valid requests in the window, you are allowed exactly 100,000 bad requests. Once you have spent 100,000, you are out of budget - and the team has an explicit policy for what happens next.

Why It Matters Here

SLOs turn the vague word "reliable" into an engineering quantity. The error budget turns that quantity into a negotiation with product management: as long as the budget is healthy, you ship. When the budget burns fast, you freeze feature work and fix reliability.

This is the core operating contract of modern SRE. Without it:

  • Reliability is an always-increasing demand with no limit.
  • Every outage produces an emotional overreaction instead of a calibrated response.
  • Engineering and product argue about priorities instead of measuring against a shared number.

The SRE Workbook is explicit: the error budget is how much unreliability you can afford, and the policy is what you do when you spend it.

Concrete Example

A payment API has two candidate SLIs.

SLI A (availability ratio): count(2xx responses) / count(all non-500-to-client-error responses). SLI B (good-request ratio): count(2xx responses within 300ms) / count(valid requests).

SLI B is stronger because it folds latency into the definition of "good." A request that succeeded but took 5 seconds is not a good payment experience.

Pick SLO: 99.9% over 28 days. Error budget: 0.1% of valid requests.

Suppose in 28 days the API handles 50,000,000 valid requests.

  • Error budget: 50,000,000 * 0.001 = 50,000 bad requests allowed.
  • In the first 10 days, a partial outage produces 30,000 bad requests. Budget spent: 60%. Burn rate: 6x (would fully burn in ~4 more days if unchanged).
  • Policy kicks in at >= 50% burn in first third of window -> freeze non-urgent feature launches, redirect engineering to reliability work, review the incident's root cause before shipping.

At the end of 28 days, if total bad requests <= 50,000, the SLO is met. Reset and start again.

Common Confusion / Misconception

"We will target 100% availability." 100% is the wrong target. Nothing is 100%: networks blip, kernels panic, dependencies rotate. The SRE book's famous observation is that the cost of moving from 99.9% to 99.99% is roughly 10x, and moving from 99.99% to 99.999% is another 10x - and the users often cannot perceive the difference. Choose a number your users actually experience; leave room to ship.

"The SLO is an aspiration." The SLO is a promise with teeth, backed by the error-budget policy. If breaching the SLO does not change behavior, you have a vanity metric, not an SLO.

"We will alert when the SLO is broken." Too late. You alert on error budget burn rate - multi-window, multi-burn-rate alerts are the standard pattern (fast burn triggers in minutes; slow burn triggers over hours). The SLO is the post-window assessment; the burn-rate alerts are how you stay ahead of it.

How To Use It

  1. For each user journey, pick one or two SLIs. Start with availability and latency; do not drown in indicators.
  2. Write each SLI as a ratio of good events to valid events. Define what "valid" means - this is where most SLIs break (e.g., are client-error 4xx responses "bad" from the service's perspective? Usually no).
  3. Choose an SLO you can afford to miss sometimes. Look at the last quarter's data. Pick a target that would have been achievable.
  4. Write the error-budget policy: what happens at 50% burn, 75% burn, 100% burn. Who decides. What freezes.
  5. Instrument multi-window burn-rate alerts (e.g., page if a 1h burn rate exceeds 14.4x or a 6h burn rate exceeds 6x; these numbers come from the SRE Workbook).
  6. Review the SLO quarterly. Tighten if you are consistently over-achieving; loosen if the target is fantasy.

Check Yourself

  1. Why is availability = uptime / window a weaker SLI than good requests / valid requests for an API?
  2. A 99.99% SLO over 30 days allows roughly how many seconds of downtime?
  3. What does "multi-burn-rate alerting" buy you that a single threshold does not?
  4. A team sets their SLO at 99.99% because their CEO said they must. The last 90 days achieved 99.92%. What happens in practice, and what's the right fix?
  5. Name one scenario where you would tighten an SLO and one where you would loosen it, citing the historical data that would justify each.
  6. Why is "the error budget is spent, so feature freeze" not a universally correct response, and what is the nuance?

Mini Drill or Application

Pick a real API (internal or public) you use daily. Write: one SLI as a ratio; one SLO with a window and numeric target; the arithmetic for the monthly error budget given typical traffic; and one sentence of error-budget policy. If you cannot estimate traffic, pick a plausible number and state the assumption.

Transfer / Where This Shows Up Later

SLOs are the single most-leveraged artifact in reliability engineering. Once you have one, every downstream concept has something to bite.

  • This module, concept 08 (failure modes): every cascading/correlated/gray failure becomes legible as "that outage consumed X% of the error budget."
  • This module, concepts 10-11 (queueing, load shedding): the error-budget policy is where you justify load-shedding; sheds protect the budget.
  • This module, concept 14 (incident lifecycle): SLO-burn alerts are the recommended symptom-based paging signal. Without an SLO, you are paging on causes.
  • S8 M5 (leadership): SLOs are the contract between SRE and product. Without an error budget, "move fast" and "stay up" are unbounded and competing; with one, they are the same conversation.
  • S9 M5 (observability): your SLI is an aggregate of metrics; shipping per-service histograms is how you actually compute it. No histograms, no SLI.
  • S10 M4 (operational readiness): a capstone-grade readiness review asks for your SLO document, your policy, and your burn-rate alerting. Anything less is a "not ready" signal.

SLOs also transfer culturally. Teams that run on SLOs argue less; teams that run on dashboards argue constantly. The SLO framing takes the emotion out of "is this reliable enough."

Read This Only If Stuck

Local chunks (book anchors)

External canonical references