Skip to main content

Runbooks and On-Call Hygiene

What This Concept Is

A runbook is a short, operational document that says: when this alert fires, do this. It is written for a specific alert (or a specific failure mode) and it is meant to be read at 3 a.m. by someone who is tired, possibly new to the service, and does not have time to read code.

A usable runbook has five parts, and usually fits on one screen:

  1. Trigger -- the alert name or the failure symptom that brings you here.
  2. Immediate verification -- 1-3 commands or dashboard links that confirm the alert is real, not flapping.
  3. Impact -- what users see right now, and roughly how many.
  4. Diagnostic steps -- the short decision tree: "if X, then page team Y; if Y, then run command Z; if Z is red, escalate".
  5. Mitigations and rollback -- the set of things that have worked before, ordered by reversibility. "Roll back the last deploy" before "restart the database".

On-call hygiene is the meta-discipline that keeps the runbook system healthy:

  • every symptom alert has a runbook (enforced at alert-creation time)
  • every incident ends with a post-incident review that updates the runbook
  • runbooks are code -- reviewed, versioned, and discoverable
  • rotations are scheduled with compassion (sleep, handoff, limit shifts)
  • alert noise is treated as a bug, not a cost of doing business

Why It Matters Here

At 3 a.m., working memory is gone and context-switching cost is high. A good runbook is the difference between a 5-minute mitigation and a 90-minute firefight with a senior engineer paged in as rescue. The SRE literature is consistent on this point: alerting without runbooks is a tax, not a safety net.

Runbooks are also how institutional knowledge becomes portable. An engineer who leaves the team takes their instinct with them; a runbook takes their instinct into a file that the next on-call can read.

Security incidents benefit the same way. A CWPP alert that pages without a runbook ("container spawned unexpected shell in namespace X") is confusing. The same alert with a runbook ("verify the image digest against the signature, isolate the pod with network policy Y, take a forensic snapshot using script Z, page the security on-call") is actionable.

Concrete Example

A runbook for the alert checkout_success_rate_below_slo (from Concept 14).

# Runbook: checkout success rate below SLO

## Trigger
Alert `checkout_success_rate_below_slo` fires when the 10-minute 2xx rate for
`POST /checkout` drops below 99.5%.

## Immediate verification (60 seconds)
1. Open the dashboard: https://grafana/d/checkout-health
2. Confirm the success-rate panel is below the SLO line
3. Check the "active incidents and recent deploys" panel for a correlated deploy

## Impact
Users cannot complete checkout. At current traffic (~2k rpm), every 1% of
error rate is ~20 failed checkouts per minute.

## Diagnostic steps
1. Error class:
- mostly 5xx -> we are the source, continue to step 2
- mostly 4xx -> check for a client/version issue; page @frontend-on-call
2. Is this correlated with the most recent deploy (annotation in the dashboard)?
- yes -> MITIGATION A
- no -> continue to step 3
3. Check upstream dependency panels (payments, inventory, orders):
- one is red -> page the upstream team, set incident severity, MITIGATION C
- all green -> MITIGATION B

## Mitigations (try in order, most reversible first)
- MITIGATION A: roll back the last deploy
`./scripts/deploy.sh rollback checkout`
Verify success rate recovers within 3 minutes.
- MITIGATION B: scale out
`kubectl scale deploy/checkout --replicas=+20%`
Only if saturation panel shows inflight saturation.
- MITIGATION C: enable the failover path
`./scripts/feature-flag.sh enable checkout.use_backup_payments`
Revert when upstream recovers.

## Escalation
- Severity 2 if success rate below 95% for >15 min
- Severity 1 if below 80% at any time
- Security concern (data exposure, auth bypass) -> page @security-oncall

## After the incident
- File an incident ticket; link this runbook
- Update this runbook with anything new you learned

That is a runbook a tired operator can execute.

Common Confusion / Misconception

"The runbook is the code." It is not. Runbooks are for humans. They link to automation, but the human is the decision-maker. Fully automated remediation is a separate system (with its own failure modes and its own runbooks for when automation itself misbehaves).

"Runbooks are architecture docs." An architecture diagram belongs somewhere, but not in the runbook. The runbook is about the operational moves for this alert, nothing more. Over-documented runbooks do not get read during incidents.

"Runbooks are evergreen." They age, fast. An unmaintained runbook is worse than none, because it sends operators down broken paths. Every incident must update the runbook, or mark steps known-bad. A last_reviewed header with a stale-after date is a cheap forcing function.

"On-call hygiene is a personal discipline." It is a team responsibility. One heroic engineer cannot carry a rotation alone; when that engineer burns out, the service does. PagerDuty's incident response guide, the Google SRE book's chapters on managing incidents and emergency response, and the SRE workbook all make this collective framing explicit.

"All alerts that page need a runbook." Yes, but the runbook can be short. Three lines naming a diagnostic dashboard, an escalation contact, and a tested mitigation script is enough for many alerts. "No runbook" ships no alert to production.

"Security runbooks are different." The structure is the same; the content emphasizes evidence preservation. A CWPP runbook for "shell in container" should front-load: isolate the pod (not kill it -- you want the process tree), snapshot the node disk, capture network flows, then remediate. Destroying evidence to speed mitigation is a common anti-pattern.

"Post-incident reviews are blame sessions." The SRE literature is consistent: blameless post-mortems produce better systems. The artifact that leaves a post-mortem is an updated runbook, an updated alert, or an updated code path -- not a list of humans who failed.

How To Use It

For every symptom alert:

  1. Write the runbook at the same time as the alert. No runbook, no alert in production.
  2. Keep it on one screen. If it does not fit, split it.
  3. Link every command to a tested script; prefer reversibility in the ordering.
  4. Put the runbook in the same repo as the alert definition, reviewed by the same PRs.
  5. In every post-incident review, list the runbook changes made. If none were made, say why.

Check Yourself

  1. Why is "one screen" a hard constraint on a runbook?
  2. Why are mitigations ordered by reversibility instead of speed?
  3. What is the difference between a runbook and automated remediation, and when do you want each?

Mini Drill or Application

Pick a real (or plausible) failure mode from a service you have worked on. Write a runbook using the 5-part template above. Have a peer read it in under two minutes and tell you, in one sentence, what they would do first. If they cannot, rewrite.

See also (external)

Depth Path


Source Backbone

Security and observability require official docs, but these books provide the systems and reliability backbone behind the practices.