Skip to main content

Writing a Runbook for the Top 3 Incidents

What This Concept Is

A runbook is a one-page, incident-specific procedure the on-call operator follows when a known failure fires. It is not documentation of the system. It is documentation of what to do, in order, right now.

A capstone runbook has five sections, each short:

  1. Symptoms -- what you see that tells you this is this incident (alert name, dashboard signal, user report phrasing).
  2. Impact -- what is broken from the user's point of view and how it maps to the SLO.
  3. Checks -- three to six ordered diagnostic steps that either confirm the incident or eliminate it.
  4. Mitigations -- the specific, ordered actions that reduce impact, with expected effect and rollback.
  5. Escalation -- who to call, what to hand off, what to write in the incident channel.

One page. Not two. If it is longer, you will not read it under pressure, which defeats its purpose.

Runbooks live next to, but are distinct from, postmortem templates and architecture docs. A runbook answers "what do I do in the next 15 minutes?" A postmortem template answers "what did we learn in the last 48 hours?" Architecture docs answer "how does this system work?" Conflating them gives you a 12-page document nobody reads at 2 a.m.

Why It Matters Here (In the Capstone)

At 2 a.m., short-term memory is worse, judgment is worse, and the blast radius of a wrong command is the same as ever. The runbook is a trust anchor against your own impaired reasoning. A runbook that has been read twice in daylight is worth more than ten smart but unpracticed thoughts at 2 a.m.

You are writing exactly three. The inputs are:

  • Cluster 4 concept 10 (top three failures)
  • Cluster 1 concept 3 (the alerts that will fire)
  • Cluster 2 concept 6 (the trace you will inspect)
  • Cluster 4 concept 11 (the mitigations you have already built)
  • Cluster 4 concept 12 (the restore drill you have already done)

If any of those is missing, the runbook has nothing real to point at. That is not an accident -- this module is a pipeline, and the runbook is its last, most operator-facing artifact.

Concrete Example -- Runbook Template

Use this exact layout. File at library/raw/runbooks/<incident-slug>.md.

# Runbook: Notification API Outage / Slowness

Last reviewed: 2026-04-22 • Owner: you

## Symptoms
- Alert fires: `webhook-slo-fast-burn` or `webhook-latency-fast-burn`
- Dashboard row "Slow?" shows p95 > 800ms sustained
- Top-errors table shows `reason=downstream_timeout` climbing

## Impact
- Webhooks are ACKed late or with 5xx on the critical path
- Affects availability SLI (target 99.5% / 30d); 5xx count contributes to burn
- External providers may mark us as down and trigger retries

## Checks (in order)
1. Open the latest SLO-burn trace in the tracing UI. Confirm the slow span
is `http.post downstream api`. If not, stop -- this is a different incident.
2. Check `notify.svc` provider status page: {{link}}
3. Check circuit-breaker metric `notify.breaker.state`. If "open", mitigation
is already active; proceed to step 4.
4. Check queue depth `events.incoming`. If climbing > 1000 events, we are
accumulating backlog.

## Mitigations (in order; stop when impact subsides)
1. **If breaker is closed** and errors < 20%: no action; keep watching.
The fast-burn may clear within the 1h window.
2. **If breaker is closed** and errors > 20%: manually open the breaker:
`POST /ops/breaker/notify/open`. Expected effect: webhook ACK latency
returns to <100ms within 30s; `notify_pending` events accumulate.
Rollback: `POST /ops/breaker/notify/close`.
3. **If queue backlog is climbing > 5000**: scale the consumer fleet
from 2 to 6 with `terraform apply -target=module.worker -var workers=6`.
Expected effect: backlog drains at ~3x normal rate. Rollback: revert
worker count within 2h to avoid cost drift.
4. **If downstream outage exceeds 30 min**: switch to the documented
degraded mode (notifications deferred to email digest). Requires
`FEATURE_NOTIFY_DEGRADED=on` env var and a redeploy. Rollback: revert.

## Escalation
- At 15 minutes without mitigation reducing symptoms: post in the
`#capstone-ops` channel with trace ID, top errors, and what you tried.
- At 30 minutes: declare incident in the incident doc template
(`/templates/incident.md`), page the reviewer (Dr. X), and stop
guessing -- begin coordinated response.

## Post-incident
- File postmortem at `library/raw/postmortems/YYYY-MM-DD-notify.md`
- Update the "Issues found" section of this runbook if any step was wrong.
- Check error budget consumption and apply policy from `library/raw/error-budget-policy.md`.

Notice: five sections, roughly one page. Every mitigation has an expected effect and a rollback. That discipline is what makes the runbook usable under stress.

Runbook drill (rehearsal). Once per runbook, simulate the page. Set a 15-minute timer. Start with only the alert name, open the runbook, and walk the checks as if the incident is real. Where you stumble, tighten the text. PagerDuty and incident.io write about this as "tabletop exercises"; at capstone scale, a single operator with a timer is enough.

Common Confusion / Misconceptions

"A runbook explains the system." No -- that is architecture documentation. A runbook only says "you are on call; this alert fired; here is what to do." It assumes the operator knows the system well enough to recognize when the runbook is wrong.

"More steps = more safety." More steps = slower response and higher chance of skipping the right step. A runbook is a triage tool. Aim for ≤ 6 checks and ≤ 6 mitigations.

"Runbooks go stale, so why write them?" They go stale because they are treated as write-once. Treat them as living documents: every time one is used, the last operator updates "Issues found." The runbook improves with each incident.

"Runbooks are for ops; I'm a developer." At capstone scale, you are both. The runbook is a gift to your future self at 2 a.m., not to a separate team.

"We can just use logs and figure it out." Fine the second time. At 2 a.m. the first time, you will page someone, or guess, or do something expensive. The cost of the runbook is 30 minutes; the cost of not having it is unbounded.

How To Use It (In Your Capstone)

  1. Take the three incidents from Cluster 4 concept 10.
  2. For each, write one runbook using the template above. One page, five sections.
  3. Have a trusted peer read the runbook (not the system) and say: "could I follow this alone?" If not, tighten.
  4. Rehearse each runbook once with a 15-minute timer, starting only from the alert name.
  5. Every time an incident happens -- even a small staging one -- update the relevant runbook's "Issues found."
  6. Link all three runbooks from the PRR checklist and from the alert that fires the page.

See also (integrative)

Check Yourself

  1. Why does each mitigation step need a stated expected effect and a rollback?
  2. What is the argument for keeping a runbook to one page even if more information is available?
  3. What is the single sentence that every Symptoms section should include?
  4. How does rehearsal with a timer reveal weaknesses that reading the runbook does not?
  5. Why does the runbook's "Escalation" section have time-based triggers rather than condition-based ones?

Mini Drill or Application (Capstone-scoped)

  1. Pick your top failure from concept 10. Write its runbook using the template above. Keep it to one page.
  2. Read your own runbook aloud as though you had been paged. If any mitigation has no rollback, add one.
  3. Hand the runbook (and only the runbook) to a peer. Ask them to walk through the checks against the dashboard and trace. Note where they get stuck.
  4. Rehearse each runbook once with a 15-minute timer. Record where you stumbled and tighten the text.
  5. Commit to library/raw/runbooks/<slug>.md and link from the alert definition.
  6. Repeat for failures 2 and 3.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.