SLO and Alert Lab
Active use only. At the end of this lab you should have one SLO document and at least one working burn-rate alert in your capstone, not a theoretical discussion of either.
Retrieval Prompts
- State the difference between an SLI, an SLO, and an SLA in your own words.
- Write the formula for an availability SLI as a ratio of events.
- For a 99.5% SLO over 30 days, how much is the error budget as a percentage of total events?
- State the two windows typically combined in a multi-window burn-rate fast-burn alert.
- Why is "CPU > 80%" usually a bad primary alert?
Compare and Distinguish
Separate these pairs clearly:
- SLI vs a "system metric" like CPU utilization
- percent-based SLO vs event-based error budget
- single-window alert vs multi-window burn-rate alert
- page-worthy symptom vs ticket-worthy symptom
- aspirational SLO vs defensible SLO for current architecture
Common Mistake Check
For each statement, identify the error:
- "Our SLO is 100% -- anything less and users complain."
- "We alert if error rate exceeds 1%; that's our SLO alert."
- "Budget is fine; we're at 78% consumed with 3 days left in the window."
- "We used the
AmazonFreeTierMetricsdefault for our SLO target." - "The alert fires on CPU > 90% because that's when things get slow."
Mini Application
For your capstone:
-
SLI formula (write it as a ratio):
SLI = <good events expression> / <total events expression> -
SLO + window + error budget (fill in concrete numbers):
- target: ____ %
- window: rolling ____ days
- error budget (% of events): ____
- error budget (absolute events, at current traffic): ____
-
Consequence for missing the SLO (one sentence):
-
Fast-burn alert (pseudocode):
condition: error_ratio(last ___) > <multiplier> * <budget fraction>
AND error_ratio(last ___) > <multiplier> * <budget fraction>
action: PAGE -
Slow-burn alert (pseudocode): same structure, looser thresholds, ticket not page.
Evidence Check
This lab is complete only if:
-
library/raw/slo.mdexists and contains the SLI, SLO, window, budget, and consequence -
library/raw/error-budget-policy.mdcontains the 5-tier ladder - at least one burn-rate alert is configured in your monitoring tool
- you have deleted or demoted to ticket status at least one non-SLO page-level alert
- you can explain every threshold number in both alerts without checking a book