Writing One Real SLI and SLO for Your Capstone

What This Concept Is

A Service Level Indicator (SLI) is a measurement: a number derived from real events that tells you whether your users are getting what they expect. A Service Level Objective (SLO) is a target on that SLI over a window: "SLI should be at least X over the last N days." Those two sentences contain the entire SRE reliability conversation -- every dashboard, alert, gate, and postmortem in this module traces back to them.

Every SLI rests on three choices you cannot avoid:

which events you count (good and total)
which window you evaluate over (usually 7, 28, or 30 days for a capstone)
which consequence triggers when the target is missed

An SLI without a target is a metric. A target without a window is a slogan. A window without a consequence is theater. You need all three, and the SRE Workbook's "SLO document" template exists precisely to force you to write them down in one place instead of scattering them across dashboards and chat logs.

SLIs come in four shapes, and picking the right shape is almost always more important than picking the right number. Availability (success/total) measures whether the call worked. Latency (fast enough / total) measures whether it worked in time. Freshness (recent enough / total) measures whether the data the user saw was current. Correctness (correct / total) measures whether the answer was right. A capstone usually needs availability plus one of the other three, and no more.

For a capstone, one SLO done well beats five SLOs done aspirationally. Pick the user-facing thing that, if it breaks, would make a user close the tab -- and measure exactly that. Everything else in Module 4 assumes this single decision has been made with integrity.

Why It Matters Here (In the Capstone)

The rest of this module assumes you have an SLI and SLO. The error budget in concept 2 is a subtraction from the SLO: 1 - target. The alerts in concept 3 are derived from the burn rate of that budget. The dashboard in cluster 2 shows the SLI. The threat model's DoS row (concept 7) is framed as "could this breach the SLO?" The runbooks in cluster 5 quote the SLO to define "impact." The PRR (concept 15) has a row that says "SLI defined" and a row that says "SLO target defensible given traffic."

If you skip this concept, everything after it is unanchored. If you define ten SLOs you do not measure, you have the same problem one layer deeper -- plausible on paper, un-enforceable in operation.

Concrete Example(s) -- from a real capstone

Capstone: a small public API that accepts incoming webhooks and writes them to a queue. Baseline traffic approximately 200,000 requests over a rolling 30-day window, roughly 4-8 req/s with daily peaks.

SLI (availability, request-based):

availability = count(http_requests where status_code < 500)
             / count(http_requests)
             measured over a rolling 30-day window, scoped to the /webhook endpoint

SLI (latency, request-based):

latency_ok  = count(http_requests where status < 500 AND duration_ms <= 300)
            / count(http_requests where status < 500)
            measured over a rolling 30-day window, scoped to the /webhook endpoint

SLO:

availability SLI >= 99.5% over any rolling 30-day window
latency SLI >= 99.0% over any rolling 30-day window

Error budget arithmetic -- do the numbers, always:

Target	Allowed bad / 30d at 200k req	Allowed downtime / 30d
99.0%	2,000	7h 12m
99.5%	1,000	3h 36m
99.9%	200	43m
99.95%	100	22m

At 99.5% you have 1,000 allowed failed requests per month, or about 216 minutes of total allowed downtime. A single bad deploy that emits 5xx for 5 minutes at peak (approximately 3,000 requests attempted) could consume 300% of the monthly budget in one incident. That arithmetic is the reason capstones cannot defend 99.9% -- not morality, not ambition, just the math of low traffic meeting a tight target.

Consequence if the SLO is missed (also written into library/raw/slo.md):

No new risky deploys for seven days after the window turns red.
One postmortem filed in library/raw/postmortems/ naming the top contributor to the burn.
The next sprint's first ticket is reliability work for the biggest cause.

Notice what made this tractable: one endpoint, one indicator, one target, a named window, and a written consequence. A junior engineer could read this and know what to measure and what to do.

Common Confusion / Misconceptions

"We have 100% as our goal." 100% is not an SLO; it is a denial of reality. Even the underlying cloud provider does not promise 100%. An SLO below 100% is what allows you to deploy, experiment, and take downtime for maintenance without immediately breaking a promise.

"Our SLI is CPU utilization." CPU is a system metric, not a service level indicator. SLIs measure what users experience: success, latency, correctness, freshness. A service can have 20% CPU and be completely broken. If your SLI would not change when the user experience gets worse, it is the wrong SLI.

"99.9% sounds more impressive, let's use that." The difference between 99.5% and 99.9% over 30 days is the difference between 216 minutes of allowed downtime and 43 minutes. Pick the number you can actually meet with the infrastructure you actually have. Aspirational SLOs get ignored within a month.

"We'll back into the SLO from what our graphs already show." That is a status number, not an objective. The SLO is a promise about the future; the status is the past. Writing the SLO from last month's graph locks in whatever reliability you happened to have, including the accidents.

How To Use It (In Your Capstone)

Before writing any code for Module 4:

Name the one user journey that defines "working" for your capstone. One.
Pick an SLI type: availability (success/total), latency (fast enough / total), freshness (recent enough / total), or correctness (correct / total).
Write the SLI as a ratio of events you can actually count from logs, metrics, or traces that already exist.
Pick a target you could defend to a senior engineer with your current architecture. Compute the error-budget table above and stare at the "allowed bad" column before committing.
Pick a window (30 days is a good default for capstones; 7 days over-reacts, 90 days hides slow regressions).
Write the consequence in one sentence: what you will stop doing and what you will do instead.
Commit the page as library/raw/slo.md, link it from the PRR checklist, and set a calendar reminder to re-read it in 30 days.

Check Yourself

Why is an SLI on CPU utilization almost always wrong even when CPU clearly correlates with user pain?
What is the numeric error budget implied by an SLO of 99.9% over 28 days, in minutes of allowed downtime?
Why is a consequence -- written down -- part of the SLO, not an optional extra?
Given 10 requests per second and a 99.5% availability SLO, roughly how many allowed 5xx responses does a 5-minute bad deploy need to emit to blow the whole 30-day budget?
What separates an SLO from a retroactive status report of your current reliability?

Mini Drill or Application (Capstone-scoped)

Write the one SLO you are willing to be judged on. Use this exact order in library/raw/slo.md:

SLI (formula):   good_events / total_events = ...
Window:          rolling 30 days
Target:          99.5%
Error budget:    0.5% of total_events over the window (N events)
Consequence:     if missed, <one sentence: what stops, what starts>
Owner:           you
Last reviewed:   YYYY-MM-DD

Compute the error-budget table for your actual traffic (Low / Normal / Peak estimates). If 99.5% is survivable but 99.9% is not, you have your defensible target.
Instrument the SLI: write the exact query/promql/logql expression that produces good / total and commit it next to the SLO doc.
Have one peer read the SLO doc and attempt to describe the consequence in their own words. If they cannot, rewrite the consequence.
Schedule the 30-day review in your calendar. No calendar entry, no SLO.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.

Building Secure and Reliable Systems - primary security and reliability backbone.
Software Engineering at Google - operational process and engineering discipline.
Designing Distributed Systems - service and reliability pattern support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Concrete Example(s) -- from a real capstone​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​