The Deployment Runbook: A One-Page Plan
What This Concept Is
A deployment runbook is a single page -- actually one printed page -- that tells a stranger how to deploy your capstone, how to tell it worked, how to roll it back, and who to contact when both fail. No more.
The page has five sections:
- Owner and On-call contact (even if it is just you, with a phone number and an escalation path)
- Pre-deploy checks (4-6 bullets;
maingreen, secrets present, migration reviewed, flag state, dependency updates) - Deploy (the command, link, or trigger -- plus the expected duration and a trivial rollback command you can run if the deploy command fails)
- Verify (the smoke endpoints and what "ok" looks like, plus the dashboard URL to watch)
- Rollback (the command or trigger, the expected duration, and the trigger list from concept 10)
If it does not fit on one page, you have a handbook, not a runbook. The page lives next to the code at RUNBOOK.md, is updated on the same PR as any pipeline change, and is printed and read out loud once per semester.
Why It Matters Here (In the Capstone)
The runbook is the capstone's operational artifact. Every other concept in this module converges here. A reviewer can skim it in 90 seconds and see whether your system is operable or merely deployable. At 3 a.m. -- or in the capstone defense when asked "what would you do if prod broke right now?" -- you will read from the runbook, not from memory.
A runbook is also the only artifact that a stranger (another student, a peer reviewer, you-next-semester) can use to deploy or recover your system without you. That strangers-can-do-it property is the distinction between a personal project and a professional one, and the capstone is graded on the latter.
Concrete Example(s)
RUNBOOK.md at the repo root:
# Capstone Deploy Runbook
**Owner:** @me
**On-call:** @me (+15555551234) -- escalate to reviewer @prof after 30m
**Prod URL:** https://api.capstone.example.com
**Dashboard:** https://console.cloud.google.com/monitoring/dashboards/custom/XYZ
**Last updated:** 2026-05-14 (commit a3b5c7d)
## Pre-Deploy
- [ ] `main` is green on CI
- [ ] CHANGELOG.md stub filled for this release
- [ ] Any new secrets added to Secret Manager
- [ ] Migrations reviewed: additive OR expand-contract pair (see library/raw/migrations/)
- [ ] No feature flags scheduled to flip in this deploy unless noted
- [ ] Cost alerts reset from last deploy
## Deploy
1. `gh workflow run deploy.yml --ref main`
2. Expect ~5 min to green. Watch the Actions run; the smoke step is the last one.
3. On success, finish CHANGELOG.md entry (Why + Risk) within 1 hour.
## Verify
- `curl https://api.capstone.example.com/healthz` -> `200`
- `curl https://api.capstone.example.com/readyz` -> `{"db":"ok"}`
- Login smoke at https://app.capstone.example.com -> /dashboard loads
- Dashboard: 5xx rate < 1% for 10 minutes; p99 < 500ms
## Rollback Triggers
- Smoke step fails in pipeline
- 5xx rate > 5% for 2 minutes
- p99 > 2x baseline for 5 minutes
- Login broken (manual)
## Rollback
1. Find last-known-good tag: `git tag --sort=-creatordate | head -5`
2. `gh workflow run deploy.yml -f image_tag=<prev-tag>`
3. Expect ~4 min to green. Watch smoke step.
4. If pipeline is unavailable, Cloud Run console -> Revisions -> previous -> "Manage Traffic" -> 100%.
5. Record the event in `library/raw/incidents/<date>.md` within 24h.
One page. Printable. Actually useful at 3 a.m.
A companion scripts/preflight.sh that mechanizes the pre-deploy checks:
#!/usr/bin/env bash
set -euo pipefail
gh run list --branch main --limit 1 --json conclusion -q '.[0].conclusion' | grep -q success || { echo "main not green"; exit 1; }
grep -q "^## Unreleased" CHANGELOG.md && { echo "CHANGELOG stub missing"; exit 1; }
echo "preflight ok"
Common Confusion / Misconceptions
- "Our runbook is in Notion/Confluence." Runbooks that live outside the repo go stale. The runbook belongs next to the code, in the same pull request as the pipeline change that would have broken it. If you must keep it in a wiki, link from
RUNBOOK.mdin the repo to the canonical version and keep a one-page printable copy in the repo anyway. - "My runbook is five pages; I want to be thorough." Five-page runbooks are used by nobody. A stranger will skim the first page and improvise the rest. Keep the primary runbook to one page; move long-form detail into separate, linked "playbooks" for specific incidents (
library/raw/playbooks/db-failover.md). - "I update the runbook when things break." That is too late. Update it on the PR that changes the deploy path -- new migration tool, new env var, new rollback mechanism. A runbook that lags the code is worse than no runbook, because people trust it.
- "The runbook is for strangers; I don't need it." You are a stranger to yourself at 3 a.m. six months from now. The runbook is for you then.
- "A runbook is an incident playbook." They are cousins, not the same. The deploy runbook covers normal operations; incident playbooks cover specific failure modes. Link between them; don't merge them.
How To Use It (In Your Capstone)
- Write v1 after your first real prod deploy, not before -- the runbook needs to reflect reality, not hopes.
- Commit it as
RUNBOOK.mdat the repo root. Update it any time an item on it changes. This is a contract with your future self. - Print it. Read it out loud. If any line takes more than 10 seconds to parse, rewrite it.
- Have one peer (or ChatGPT, or yourself a week later) read it and try to deploy from it. Fix whatever they get wrong.
- Link every
Verifybullet to a dashboard query or smoke command -- vague bullets produce wrong verification. - Keep it one page. If it grows, extract playbooks; do not let
RUNBOOK.mdbecome a handbook. - Re-read before every prod deploy. It is a 90-second pre-flight and catches things your preflight script does not.
Runbook as Interview Artifact
During the capstone defense, the runbook is one of the first artifacts you will hand to a reviewer. A one-page runbook with pre-deploy checks, the deploy command, a verify script, a trigger list, and a rollback command communicates more about your engineering maturity than a 30-slide architecture deck. It is proof that you can be handed a paging incident at 3 a.m. and not improvise.
In a real team setting, the same runbook is the artifact you leave behind when rotating off on-call. "I updated RUNBOOK.md" is the equivalent of a clean commit message for operational work: it tells the next on-call what changed and why. Start practicing that habit now, at capstone scale, so it is reflexive when the team size grows.
See also (integrative)
- S9 M05 Cluster 5: Runbooks and on-call hygiene -- canonical treatment; capstone is the minimal one-page form
- S9 M05 Cluster 5: Dashboards that answer questions -- the Verify section points at these dashboards
- S9 M04 Cluster 5: Environments, approvals, change management -- the Deploy section encodes the approval/promotion rules
- S8 M04 Cluster 5: Incident lifecycle -- the runbook is the "mitigate" asset
- S8 M05 Cluster 4: Audience-aware explanation -- the "stranger at 3 a.m." audience is the ultimate audience-aware target
- Google SRE Book: Testing for Reliability -- the runbook's verify section is the production-probe surface
- Google SRE Workbook: On-call -- on-call hygiene and escalation, scaled down to solo
- AWS Well-Architected: Operational Excellence -- operational-readiness rubric
- PagerDuty: Incident Response Runbooks -- alert-to-runbook pairing conventions
Check Yourself
- Does your runbook fit on one printed page?
- If your laptop is gone, can a stranger deploy from the runbook alone?
- When was the runbook last updated, and what changed in the deploy path since?
- Is every
Verifybullet backed by a command or URL, not a vague claim? - What is the escalation path if you (the owner) are unavailable for 30 minutes?
- Which line in the runbook is most likely to be wrong right now, and why haven't you fixed it?
Mini Drill or Application (Capstone-scoped)
- Write or revise
RUNBOOK.mdtoday. Print it. Hand it to a peer (or read it to yourself out loud). Ask "how would you deploy this?" and note every line they pause on. Fix those first. - Peer-deploy test. Ask a peer to run through the runbook literally -- they execute the commands, you only answer "yes/no." Every question they ask that is not in the runbook is a gap you fix.
- Runbook-PR discipline. On the next PR that touches
deploy.yml,terraform/, orscripts/smoke.sh, updateRUNBOOK.mdin the same PR. If the PR is merged without a runbook touch, open a fast-follow to add it -- but prefer doing it once.
Source Backbone
Capstone deployment applies cloud, delivery, and operations material. These books are the source backbone for the delivery decisions.
- Building Secure and Reliable Systems - secure/reliable deployment posture.
- GitHub Actions in Action - workflow automation support.
- Pro Git - release history, tags, and branch discipline.
- The Linux Command Line - shell and deployment automation support.