Skip to main content

SLO and Alert Lab

Active use only. At the end of this lab you should have one SLO document and at least one working burn-rate alert in your capstone, not a theoretical discussion of either.

Retrieval Prompts

  1. State the difference between an SLI, an SLO, and an SLA in your own words.
  2. Write the formula for an availability SLI as a ratio of events.
  3. For a 99.5% SLO over 30 days, how much is the error budget as a percentage of total events?
  4. State the two windows typically combined in a multi-window burn-rate fast-burn alert.
  5. Why is "CPU > 80%" usually a bad primary alert?

Compare and Distinguish

Separate these pairs clearly:

  • SLI vs a "system metric" like CPU utilization
  • percent-based SLO vs event-based error budget
  • single-window alert vs multi-window burn-rate alert
  • page-worthy symptom vs ticket-worthy symptom
  • aspirational SLO vs defensible SLO for current architecture

Common Mistake Check

For each statement, identify the error:

  1. "Our SLO is 100% -- anything less and users complain."
  2. "We alert if error rate exceeds 1%; that's our SLO alert."
  3. "Budget is fine; we're at 78% consumed with 3 days left in the window."
  4. "We used the AmazonFreeTierMetrics default for our SLO target."
  5. "The alert fires on CPU > 90% because that's when things get slow."

Mini Application

For your capstone:

  1. SLI formula (write it as a ratio):

    SLI = <good events expression> / <total events expression>
  2. SLO + window + error budget (fill in concrete numbers):

    • target: ____ %
    • window: rolling ____ days
    • error budget (% of events): ____
    • error budget (absolute events, at current traffic): ____
  3. Consequence for missing the SLO (one sentence):

  4. Fast-burn alert (pseudocode):

    condition:  error_ratio(last ___) > <multiplier> * <budget fraction>
    AND error_ratio(last ___) > <multiplier> * <budget fraction>
    action: PAGE
  5. Slow-burn alert (pseudocode): same structure, looser thresholds, ticket not page.

Evidence Check

This lab is complete only if:

  • library/raw/slo.md exists and contains the SLI, SLO, window, budget, and consequence
  • library/raw/error-budget-policy.md contains the 5-tier ladder
  • at least one burn-rate alert is configured in your monitoring tool
  • you have deleted or demoted to ticket status at least one non-SLO page-level alert
  • you can explain every threshold number in both alerts without checking a book