Alert on the SLO, Not Everything
What This Concept Is
Alerting well is not "detect everything that could go wrong." It is "wake a human only when the answer is a decision a human has to make right now."
The cleanest way to get there is to derive alerts from the SLO and its error budget, rather than from individual low-level metrics. The SRE Workbook calls this the multi-window, multi-burn-rate pattern: pick two or three time windows, compare the burn rate against the monthly allowance at each window, and only fire when two windows agree. Two alerts cover most of a small service:
- a fast burn alert: "we are burning budget so quickly that we will exhaust it in hours if this continues"
- a slow burn alert: "we are burning budget over days, slowly, and the window will turn red soon"
Everything else belongs in a dashboard, in a report, or in a ticket queue -- not in your pager. An alert is a contract: "if this fires, the on-call operator will stop what they are doing and act." If that is not true for an alert, delete it.
The math under the hood: a burn-rate threshold of k means "at this rate, you will consume 100% of the budget in window / k." Thus a 14.4x burn on a 30-day budget consumes it in just over 2 days; a 6x burn consumes it in 5 days; a 1x burn is on track to exactly deplete the budget by window end. The standard thresholds (14.4, 6, 3, 1) are not magic -- they correspond to "consume 2% of budget in 1h," "5% in 6h," "10% in 3d" respectively, which are the tradeoff points Google landed on after thousands of real alerts.
This concept does not replace the dashboards in Cluster 2 or the runbooks in Cluster 5; it decides what of the dashboard's information is urgent enough to page.
Why It Matters Here (In the Capstone)
In a capstone with one operator (you), every false page is more expensive than the problem it reports. The failure mode is not "we missed an incident"; it is "we paged for noise, trained ourselves to ignore the pager, and then missed the real one."
Worse: without SLO-linked alerting, the pressure to add more alerts never stops. Every incident becomes "we should have been alerted on this specific symptom" -- and the alerts grow until nobody looks at them. SLO burn-rate alerts short-circuit that spiral by tying every page to a user-visible consequence defined in concept 1.
The on-call posture in concept 14 caps page volume at "<= 2 per week." Burn-rate alerts are how you make that cap real. If every paging alert is SLO-linked, the volume stays bounded to the actual user-visible incidents your system produces.
Concrete Example(s) -- from a real capstone
Using the capstone from concepts 1-2 (availability SLO 99.5% over 30 days, budget = 0.5% of requests):
Fast-burn alert (Prometheus-style recording + alerting rule):
# Recording rule: error ratios over 5m and 1h windows
- record: job:webhook_errors:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="api", status=~"5..", route="/webhook"}[5m]))
/
sum(rate(http_requests_total{job="api", route="/webhook"}[5m]))
- record: job:webhook_errors:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="api", status=~"5..", route="/webhook"}[1h]))
/
sum(rate(http_requests_total{job="api", route="/webhook"}[1h]))
# Alerting rule: fast-burn (14.4x over 5m AND 1h)
- alert: WebhookSLOFastBurn
expr: |
job:webhook_errors:ratio_rate5m > (14.4 * 0.005)
and
job:webhook_errors:ratio_rate1h > (14.4 * 0.005)
for: 2m
labels: { severity: page }
annotations:
summary: "Webhook SLO fast-burn (>7.2% errors over 5m+1h)"
runbook: "library/raw/runbooks/webhook-notify-outage.md"
Slow-burn alert (6x over 6h/24h):
- alert: WebhookSLOSlowBurn
expr: |
job:webhook_errors:ratio_rate6h > (6 * 0.005)
and
job:webhook_errors:ratio_rate24h > (3 * 0.005)
for: 15m
labels: { severity: ticket }
annotations:
summary: "Webhook SLO slow-burn (>3% errors over 6h)"
runbook: "library/raw/runbooks/webhook-notify-outage.md"
What each means in plain English:
| Alert | Threshold | Budget consumed if sustained | Action |
|---|---|---|---|
| Fast-burn | 14.4x for 5m AND 1h | approximately 2% of 30-day budget in 1h | PAGE on-call |
| Slow-burn | 6x for 6h AND 24h | approximately 5% of budget in 6h | TICKET, look by business day |
These numbers come straight from the multi-window, multi-burn-rate pattern in the SRE Workbook "Alerting on SLOs" appendix. The two-window AND gate is what filters one-minute spikes that would otherwise be noise at low capstone traffic.
What you do NOT alert on:
- CPU > 80%
- memory > 70%
- disk used > 60%
- queue depth > threshold (unless it directly breaches the SLI)
- any "warning" level metric that does not imply a user impact
All of those belong on the dashboard from concept 5. If CPU is high and the SLO is fine, the system is doing its job.
Common Confusion / Misconceptions
"If we don't alert on CPU we will miss a runaway." A runaway CPU that does not affect the SLI is not a user-visible problem; it is a capacity signal. Put it on a dashboard, review it weekly, and capacity-plan. If it does affect latency or errors, the SLO alert will catch it -- and it will catch it for the right reason.
"The low-traffic problem: a single failed request is 10% of our rate." Correct, and this is why multi-window alerts matter. A 1-minute window at low traffic is nearly useless; require the burn to persist over 5 minutes and an hour before paging. If you have fewer than approximately 10 requests per minute, do not alert on short windows at all -- alert only on the slow burn.
"We need an alert for every thing that could break." No. You need an alert for every user-visible thing that would make you want to be paged. Everything else is a trace, log, metric, or ticket.
"We'll tune thresholds later." Alerts never get tuned by accident; they only get ignored. Tune at creation time: derive thresholds from the budget, simulate one failure mode, and commit the rule with its runbook in the same PR.
How To Use It (In Your Capstone)
- Start from the SLO. Write burn rates over two windows (e.g., 5m/1h and 6h/24h).
- Multiply the burn rate by the error-budget fraction to get the error-rate threshold.
- Wire the alert to fire only when both windows agree (AND, not OR).
- Page only for the fast-burn alert. Ticket for the slow-burn alert.
- Each alert annotation must include a runbook link; alerts without runbooks fail the PRR.
- Delete every alert that is not tied to a user symptom. Track deletions; it is the healthiest alert-hygiene metric there is.
- Simulate once per quarter: inject synthetic 5xx for 5 minutes in staging and verify the fast-burn alert fires, then recovers cleanly.
See also (integrative)
- S8 M04 -- symptom-based alerting and page fatigue, the origin of the "alert on the SLO" discipline:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-05-incident-and-observability/14-incident-lifecycle-detect-triage-mitigate-resolve-review-primary.md. - S9 M05 Cluster 5 -- alerting on symptoms not causes: the cloud-operations version of this rule:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-05-operating-under-observation/14-alerting-on-symptoms-not-causes-primary.md. - S9 M05 Cluster 4 -- metrics cardinality / RED / USE: cardinality budgets constrain which burn-rate queries actually scale:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-04-observability-pillars-in-cloud/10-metrics-cardinality-exemplars-use-red-primary.md. - S6 M05 -- asynchrony and "impossibility of distinguishing slow from dead" explains why short-window alerts are structurally unreliable:
../../../../semester-06-databases-distributed/module-05-distributed-systems-fundamentals/concepts/cluster-01-the-inescapable-reality/03-asynchrony-and-the-impossibility-of-distinguishing-slow-from-dead-primary.md. - Google SRE Book -- Practical Alerting -- the multi-window, multi-burn-rate pattern and alert hygiene principles, including the "do we need to wake a human?" test.
- Prometheus docs -- Alerting rules -- the
for:clause and templating that this concept's YAML depends on. - Prometheus docs -- Recording rules and rule naming -- why aggregating numerator and denominator separately for burn-rate ratios matters.
Check Yourself
- Why does a single-window "error rate > 1%" alert behave poorly at capstone traffic?
- Why does the fast-burn alert require both the 5m window AND the 1h window to be hot?
- Name one metric you currently alert on that does not describe a user-visible symptom, and write the ticket to delete it.
- What does a 14.4x burn rate mean in plain English about how quickly the 30-day budget would be exhausted?
- Why must every paging alert carry a runbook link as an annotation?
Mini Drill or Application (Capstone-scoped)
- For your capstone SLO: write both burn-rate alerts in Prometheus / alert-manager / CloudWatch syntax. Commit them to
alerts/. - Silence or delete every non-SLO pager alert. Leave the metrics on a dashboard and record the deletions in a short CHANGELOG.
- Simulate: force 5 minutes of injected 5xx on a staging environment and confirm the fast-burn alert fires, then recovers. Capture a screenshot.
- Write one sentence in the runbook (Cluster 5) for each alert: "if this fires, the first thing I check is ___."
- Once per quarter, recount your paging alerts. If the list has grown without a new SLO, treat it as a hygiene regression and prune.
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.