Skip to main content

Error Budget for a Capstone: Small but Real

What This Concept Is

An error budget is the absolute number of bad events you are allowed to emit in a window before you have broken your SLO. It is just 1 - SLO in percentage terms, translated into the units you actually count: failed requests, late requests, stale reads, corrupted outputs -- whatever shape your SLI takes.

At capstone scale the absolute numbers look laughably small. A 99.5% SLO on 200,000 requests per month allows exactly 1,000 bad requests. A 99.9% SLO on 10,000 requests per day allows ten. That is not a bug. The entire point of the error budget is to make the small numbers visible so you have to decide, explicitly, what to do with them -- instead of hiding behind percentages that treat "we had one bad hour" as equivalent to "we had one bad minute."

An error budget has three roles:

  • a cap on failures you can emit in the window without breaking the SLO
  • a signal: burn rate (consumption per unit time relative to window allowance) tells you how fast you are using the budget
  • a policy: what changes when half, three-quarters, or all of the budget is gone

Without a policy, a budget is just an arithmetic identity. With one, it becomes the control loop that actually governs risk-taking. The Google SRE Workbook "error budget policy" template makes the policy concrete: it lists conditions, and for each condition a decision about releases, changes, and on-call response. That is the artifact this concept ends with.

A subtlety that bites capstones: latency and availability budgets are separate. A 1% latency budget and a 0.5% availability budget compound -- a deploy can pass the availability check while burning latency budget, or vice versa. Treat each SLI's budget as an independent ledger with its own policy ladder.

Why It Matters Here (In the Capstone)

A capstone has low traffic. That means:

  • a single bad deploy can blow 30 days of budget in 10 minutes
  • noise is real: one flaky dependency can push you below SLO even if nothing is wrong with your code
  • budget-based alerting must be calibrated for the traffic you have, not the traffic Google has

If you do not budget, you will either alert on every blip (and learn to ignore your pager -- the failure mode concept 3 warns against) or miss the one that matters (because 0.5% of 200,000 is still a thousand unhappy users). The error budget is also what turns the on-call posture in concept 14 from "keep everything perfect" into "spend the allowance deliberately."

Concrete Example(s) -- from a real capstone

Using the same webhook-handler capstone from concept 1:

  • SLO: availability >= 99.5% over 30 days
  • Expected requests in 30 days: ≈ 200,000 (baseline 4-8 req/s)
  • Error budget in events: 0.5% × 200,000 = 1,000 allowed 5xx responses
  • Error budget in time-at-zero-traffic terms: 99.5% over 30 days = 216 minutes of total downtime allowed

Policy ladder (library/raw/error-budget-policy.md):

Budget consumedWhat it meansPolicy
0-25%Normal operationShip features. Deploy with standard reviews.
25-50%Budget is burning faster than expectedReview recent deploys. Keep feature work, but no risky infra changes.
50-75%Material risk to missing SLOFreeze risky changes. Next sprint's first ticket is reliability.
75-100%Near-miss territoryDeploy only bug fixes and reliability work. Postmortem required if exhausted.
> 100%SLO violatedApply the "Consequence" from the SLO document. No new features for seven days.

Notice the ladder: a single value ("budget used: 62%") maps to a specific set of decisions. That is the control loop.

Worked arithmetic -- what one real incident costs:

Suppose the notification API from concept 11 has a 20-minute degradation. During the incident, 5xx rate on /webhook is 8% (instead of the normal near-zero). Traffic during those 20 minutes was ≈ 6,000 requests. Bad events = 6,000 × 0.08 = 480. That is 48% of the monthly 1,000-event budget, from one 20-minute incident. After the incident, you are squarely in the 25-50% tier -- "keep feature work, but no risky infra changes." Two more incidents like it would cross 75% and trigger the freeze.

This is why capstone error budgets feel uncomfortably small. The discipline is to accept the arithmetic instead of lying about the target.

Latency budget example (separate ledger):

  • Latency SLI: count(status<500 AND duration_ms<=300) / count(status<500) ≥ 99.0% over 30 days.
  • Budget: 1% of ≈ 200,000 good responses = 2,000 allowed slow responses.
  • A week of mildly slow tails (p95 creeping to 350ms) can eat this budget without ever triggering a user-visible outage. That is exactly why latency needs its own policy ladder, usually a notch looser than availability for a capstone.

Burn-rate math you should have memorised:

  • burn_rate = (events_bad_in_window / events_total_in_window) / (1 - SLO)
  • A burn rate of 1× over any window means "on pace to exactly deplete the 30-day budget by window end."
  • A burn rate of 14.4× means "on pace to deplete the 30-day budget in 2 days." That is the fast-burn trigger in concept 3.
  • A burn rate of 6× means "on pace to deplete the 30-day budget in 5 days." That is the typical slow-burn trigger.

Put those three numbers on the dashboard next to the budget percentage, and every operator looking at the dashboard can tell -- at a glance -- whether the current blip is "noise" or "the clock is ticking."

Common Confusion / Misconceptions

"Error budget means I should break things on purpose." No. The budget is a permission slip for necessary risk: deploys, dependency upgrades, experiments. If the budget is intact at the end of the window, you were under-shipping or over-engineering, not winning -- but that is a diagnostic signal, not a mandate to cause outages.

"The budget is per day." It is per window. A rolling 30-day budget rebuilds gradually as bad minutes from 31 days ago fall off the back. This is why a single 90-minute outage today can still be inside budget but can also knock you out tomorrow if another 30-minute incident joins it.

"The budget only counts outages." It counts all failed events in the SLI: 5xx responses, timeouts past your latency target, stale-freshness reads, corrupted outputs -- whatever your SLI is built from. Latency SLOs and availability SLOs have separate budgets and separate ladders.

"The budget is the total of all SLI budgets." Budgets do not combine. A 1% latency budget and a 0.5% availability budget are two ledgers. Over-spending one does not entitle you to borrow from the other.

"If we're not burning budget, we're fine." Sustained 0% burn means your SLO is too loose, your product is under-used, or you are shipping too slowly. All three deserve investigation. The budget being deliberately spent -- on features, experiments, planned maintenance -- is the healthy state.

"Planned maintenance shouldn't count against the budget." If the user cannot tell the difference between "planned maintenance" and "you are down," the SLI does not distinguish them either. Either accept the burn, exempt specific windows contractually (rare for capstones), or do the maintenance under a feature flag that does not break the SLI.

How To Use It (In Your Capstone)

  1. Compute the budget in events, not just percent. Percentages lie at low traffic.
  2. Write the policy ladder (0-25, 25-50, 50-75, 75-100, >100) into library/raw/error-budget-policy.md, with different ladders for availability and latency if both exist.
  3. Decide, once, what each tier forbids and what it requires. Check this into the repo.
  4. When the budget crosses a tier, follow the policy. No exceptions, or the budget is meaningless.
  5. Wire the budget into CI as a soft gate: at 75% consumed, deploys require a comment referencing the tier; at 100%, deploys for non-reliability changes fail.
  6. At the end of each 30-day window, write one paragraph about whether the policy served you.
  7. Link the policy from the PRR checklist and from each runbook's "Impact" section.

Anti-pattern you should refuse to ship: a "budget health indicator" that only goes red at 100%. That is a binary crisis signal, not a control loop. The point of the ladder is that tier transitions are actionable -- 49% -> 51% is the moment you start declining risky deploys, not the moment the pager rings.

See also (integrative)

Check Yourself

  1. A 99.9% SLO on a service handling 10,000 requests per day: what is the weekly error budget in failed requests?
  2. Why is "the budget is fine" a worse answer than "the budget is 31% consumed, last refresh on "?
  3. Why does a policy ladder protect you more than a single "budget exhausted" alert?
  4. If your latency budget is at 90% but availability is at 10%, which of the two ladders applies to a proposed risky deploy, and why?
  5. Why does a rolling 30-day window behave differently from a fixed calendar month near month boundaries?
  6. If a single 20-minute incident consumes 48% of the monthly budget, which tier does the system enter, and what is forbidden in that tier for the rest of the window?
  7. Describe, in one sentence, why a latency budget and an availability budget are independent ledgers rather than a single composite number.

Mini Drill or Application (Capstone-scoped)

  1. For your capstone's SLO: compute the error budget in events for one 30-day window using a reasonable estimate of traffic.
  2. Write the 5-tier policy ladder as a markdown table in library/raw/error-budget-policy.md.
  3. For each tier, write one sentence that starts with "no" (what is forbidden) and one that starts with "must" (what is required).
  4. Estimate how much one bad deploy would consume (3-5 minutes of 5xx at peak). If it is more than 20% of the monthly budget, flag the target as too tight and either loosen it or harden deploys before freezing it.
  5. Wire the policy into your deploy workflow: at minimum, a comment in the PR description asking "which tier are we in?" with the expected answer reviewed at merge time.
  6. Simulate a tier change: force the budget-consumed number to 80% in a test query and walk through what the ladder demands. Iterate until the ladder is specific enough that you could not bluff your way past it at 2 a.m.
  7. Keep a short monthly log of budget-end-state: tier reached, top two contributors, and what you did about them. Three windows of history is usually enough to see whether the SLO is calibrated or aspirational.
  8. If the ladder has been crossed more than once in 30 days without the forbid/require clauses actually kicking in, the ladder is a lie -- rewrite it or tighten enforcement before adding a new feature.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.