Skip to main content

Alert on the SLO, Not Everything

What This Concept Is

Alerting well is not "detect everything that could go wrong." It is "wake a human only when the answer is a decision a human has to make right now."

The cleanest way to get there is to derive alerts from the SLO and its error budget, rather than from individual low-level metrics. The SRE Workbook calls this the multi-window, multi-burn-rate pattern: pick two or three time windows, compare the burn rate against the monthly allowance at each window, and only fire when two windows agree. Two alerts cover most of a small service:

  • a fast burn alert: "we are burning budget so quickly that we will exhaust it in hours if this continues"
  • a slow burn alert: "we are burning budget over days, slowly, and the window will turn red soon"

Everything else belongs in a dashboard, in a report, or in a ticket queue -- not in your pager. An alert is a contract: "if this fires, the on-call operator will stop what they are doing and act." If that is not true for an alert, delete it.

The math under the hood: a burn-rate threshold of k means "at this rate, you will consume 100% of the budget in window / k." Thus a 14.4x burn on a 30-day budget consumes it in just over 2 days; a 6x burn consumes it in 5 days; a 1x burn is on track to exactly deplete the budget by window end. The standard thresholds (14.4, 6, 3, 1) are not magic -- they correspond to "consume 2% of budget in 1h," "5% in 6h," "10% in 3d" respectively, which are the tradeoff points Google landed on after thousands of real alerts.

This concept does not replace the dashboards in Cluster 2 or the runbooks in Cluster 5; it decides what of the dashboard's information is urgent enough to page.

Why It Matters Here (In the Capstone)

In a capstone with one operator (you), every false page is more expensive than the problem it reports. The failure mode is not "we missed an incident"; it is "we paged for noise, trained ourselves to ignore the pager, and then missed the real one."

Worse: without SLO-linked alerting, the pressure to add more alerts never stops. Every incident becomes "we should have been alerted on this specific symptom" -- and the alerts grow until nobody looks at them. SLO burn-rate alerts short-circuit that spiral by tying every page to a user-visible consequence defined in concept 1.

The on-call posture in concept 14 caps page volume at "<= 2 per week." Burn-rate alerts are how you make that cap real. If every paging alert is SLO-linked, the volume stays bounded to the actual user-visible incidents your system produces.

Concrete Example(s) -- from a real capstone

Using the capstone from concepts 1-2 (availability SLO 99.5% over 30 days, budget = 0.5% of requests):

Fast-burn alert (Prometheus-style recording + alerting rule):

# Recording rule: error ratios over 5m and 1h windows
- record: job:webhook_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="api", status=~"5..", route="/webhook"}[5m]))
/
sum(rate(http_requests_total{job="api", route="/webhook"}[5m]))

- record: job:webhook_errors:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="api", status=~"5..", route="/webhook"}[1h]))
/
sum(rate(http_requests_total{job="api", route="/webhook"}[1h]))

# Alerting rule: fast-burn (14.4x over 5m AND 1h)
- alert: WebhookSLOFastBurn
expr: |
job:webhook_errors:ratio_rate5m > (14.4 * 0.005)
and
job:webhook_errors:ratio_rate1h > (14.4 * 0.005)
for: 2m
labels: { severity: page }
annotations:
summary: "Webhook SLO fast-burn (>7.2% errors over 5m+1h)"
runbook: "library/raw/runbooks/webhook-notify-outage.md"

Slow-burn alert (6x over 6h/24h):

- alert: WebhookSLOSlowBurn
expr: |
job:webhook_errors:ratio_rate6h > (6 * 0.005)
and
job:webhook_errors:ratio_rate24h > (3 * 0.005)
for: 15m
labels: { severity: ticket }
annotations:
summary: "Webhook SLO slow-burn (>3% errors over 6h)"
runbook: "library/raw/runbooks/webhook-notify-outage.md"

What each means in plain English:

AlertThresholdBudget consumed if sustainedAction
Fast-burn14.4x for 5m AND 1happroximately 2% of 30-day budget in 1hPAGE on-call
Slow-burn6x for 6h AND 24happroximately 5% of budget in 6hTICKET, look by business day

These numbers come straight from the multi-window, multi-burn-rate pattern in the SRE Workbook "Alerting on SLOs" appendix. The two-window AND gate is what filters one-minute spikes that would otherwise be noise at low capstone traffic.

What you do NOT alert on:

  • CPU > 80%
  • memory > 70%
  • disk used > 60%
  • queue depth > threshold (unless it directly breaches the SLI)
  • any "warning" level metric that does not imply a user impact

All of those belong on the dashboard from concept 5. If CPU is high and the SLO is fine, the system is doing its job.

Common Confusion / Misconceptions

"If we don't alert on CPU we will miss a runaway." A runaway CPU that does not affect the SLI is not a user-visible problem; it is a capacity signal. Put it on a dashboard, review it weekly, and capacity-plan. If it does affect latency or errors, the SLO alert will catch it -- and it will catch it for the right reason.

"The low-traffic problem: a single failed request is 10% of our rate." Correct, and this is why multi-window alerts matter. A 1-minute window at low traffic is nearly useless; require the burn to persist over 5 minutes and an hour before paging. If you have fewer than approximately 10 requests per minute, do not alert on short windows at all -- alert only on the slow burn.

"We need an alert for every thing that could break." No. You need an alert for every user-visible thing that would make you want to be paged. Everything else is a trace, log, metric, or ticket.

"We'll tune thresholds later." Alerts never get tuned by accident; they only get ignored. Tune at creation time: derive thresholds from the budget, simulate one failure mode, and commit the rule with its runbook in the same PR.

How To Use It (In Your Capstone)

  1. Start from the SLO. Write burn rates over two windows (e.g., 5m/1h and 6h/24h).
  2. Multiply the burn rate by the error-budget fraction to get the error-rate threshold.
  3. Wire the alert to fire only when both windows agree (AND, not OR).
  4. Page only for the fast-burn alert. Ticket for the slow-burn alert.
  5. Each alert annotation must include a runbook link; alerts without runbooks fail the PRR.
  6. Delete every alert that is not tied to a user symptom. Track deletions; it is the healthiest alert-hygiene metric there is.
  7. Simulate once per quarter: inject synthetic 5xx for 5 minutes in staging and verify the fast-burn alert fires, then recovers cleanly.

See also (integrative)

Check Yourself

  1. Why does a single-window "error rate > 1%" alert behave poorly at capstone traffic?
  2. Why does the fast-burn alert require both the 5m window AND the 1h window to be hot?
  3. Name one metric you currently alert on that does not describe a user-visible symptom, and write the ticket to delete it.
  4. What does a 14.4x burn rate mean in plain English about how quickly the 30-day budget would be exhausted?
  5. Why must every paging alert carry a runbook link as an annotation?

Mini Drill or Application (Capstone-scoped)

  1. For your capstone SLO: write both burn-rate alerts in Prometheus / alert-manager / CloudWatch syntax. Commit them to alerts/.
  2. Silence or delete every non-SLO pager alert. Leave the metrics on a dashboard and record the deletions in a short CHANGELOG.
  3. Simulate: force 5 minutes of injected 5xx on a staging environment and confirm the fast-burn alert fires, then recovers. Capture a screenshot.
  4. Write one sentence in the runbook (Cluster 5) for each alert: "if this fires, the first thing I check is ___."
  5. Once per quarter, recount your paging alerts. If the list has grown without a new SLO, treat it as a hygiene regression and prune.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.