Production Readiness Review -- a Capstone Checklist
What This Concept Is
A Production Readiness Review (PRR) is a structured gate: before a service is treated as "in production," a checklist is walked item by item, and each item must be green (done), yellow (accepted risk, documented), or red (blocker). At Google's original scale the PRR is run by an SRE team on a launching service; at capstone scale it is run by you, on yourself, honestly.
This concept contains the actual checklist you will walk. It distills every preceding concept in this module into a single sheet. The PRR is the artifact a senior engineer or examiner looks at first -- if it is defensible, the rest of the defense becomes tractable.
The PRR is not a rubber stamp. Any "yellow" requires a one-line written justification and a date by which the risk will be retired; any "red" either blocks launch or becomes an accepted launch-risk with a dated follow-up. Rigour is what distinguishes a PRR from a self-congratulation exercise.
PRRs also have a reviewing character: in industry they are walked by someone other than the author. In a capstone you play both roles -- author and reviewer -- which means you must read the checklist with the skepticism a senior engineer would bring, not the optimism of the person who built the thing.
Why It Matters Here (In the Capstone)
Everything in Semester 10 converges here. You have designed (M1), built (M2), deployed (M3), and now must certify that what you deployed is operable and defensible. The PRR is how you certify it. The portfolio module (M5) points back to a signed PRR as evidence.
A capstone without a PRR is a demo. A capstone with one is a system.
Concrete Example -- the Checklist
File at library/raw/prr.md. Walk this line-by-line before declaring the capstone ready. 18 items, grouped by cluster.
SLOs and Error Budgets (Cluster 1)
- 1. SLI defined -- one SLI is written as
good / totalover a specific event source. (Concept 1) - 2. SLO target and window -- target and rolling window documented; target is defensible given actual traffic. (Concept 1)
- 3. Error budget + policy -- budget computed in events, 5-tier policy ladder in
library/raw/error-budget-policy.md. (Concept 2) - 4. Burn-rate alerts wired -- at least one fast-burn (page) and one slow-burn (ticket) multi-window alert. (Concept 3)
Observability in Practice (Cluster 2)
- 5. Structured logs at decision boundaries -- event naming schema in
library/raw/logging.md, fields stable across services. (Concept 4) - 6. Three-question dashboard -- single dashboard answers "healthy? slow? failing whom?" in < 10s. (Concept 5)
- 7. Critical-path trace -- one full distributed trace of the SLO path, sampling policy documented, errors always retained. (Concept 6)
Threat Model (Cluster 3)
- 8. STRIDE worksheet -- DFD + STRIDE table for at least the top trust-boundary flow; at least one letter walked all the way to a deployed mitigation. (Concept 7)
- 9. Secrets policy enforced -- secrets in a manager (not in
.env); rotation cadence documented; pre-commit + CI leak scanning live. (Concept 8) - 10. Dependency scanning live -- CI fails on HIGH/CRITICAL CVEs; pinned versions with hashes. (Concept 8)
- 11. Supply chain at SLSA L2 -- signed provenance on build; attestation verified at deploy. (Concept 8)
- 12. Least privilege verified -- every runtime and CI identity reviewed; at least one role tightened until it broke and widened just enough;
library/raw/iam.mdup to date. (Concept 9)
Failure Planning (Cluster 4)
- 13. Top-three failures named --
library/raw/top-failures.mdlists three likely-and-impactful failures with first-check guidance. (Concept 10) - 14. Retry / breaker / degraded-mode decisions --
library/raw/reliability-decisions.mdper external dependency; tested in staging. (Concept 11) - 15. Backup + restore drill -- one restore executed end to end within the last 30 days; log in
library/raw/recovery.md; RPO / RTO documented. (Concept 12)
Runbooks and On-Call (Cluster 5)
- 16. Three runbooks -- one per top failure, using the five-section template; peer-read and confirmed followable. (Concept 13)
- 17. On-call posture documented -- coverage hours, page-vs-ticket rules, kill switch, fallback person. (Concept 14)
- 18. PRR signed -- you have read every item above, every box is green or yellow-with-reason, and you sign and date this file. (this concept)
At the bottom of library/raw/prr.md:
PRR signature
-------------
Signed: <your name>
Date: YYYY-MM-DD
Next review: YYYY-MM-DD (+90d)
Outstanding risks (yellow items):
- <item>: <reason accepted> -- revisit by <date>
Evidence discipline. Every green must link to an artifact: the SLO doc, the Grafana dashboard URL, the runbook, the CI log showing the vuln-scan gate, the Terraform plan showing least-privilege policy, the restore log. A green without evidence is a yellow masquerading.
Operating cadence after signature. PRR is not a one-time event. Set a calendar reminder 90 days out. Any material architectural change -- new dependency, new data store, new identity, change in traffic profile -- triggers an early re-review of the affected rows. Between reviews, any incident updates the relevant runbook and may reopen rows 13-16.
Common Confusion / Misconceptions
"A PRR is a one-time event." It is a gate and a recurring review. Schedule the next PRR review 90 days out. Any significant architectural change triggers an early re-review.
"I'll mark items green to pass my own review." The PRR is only as honest as the operator. Under-reporting now becomes an incident later. If a senior engineer reads your PRR and the artifacts do not support the green marks, trust is lost faster than with a few honest yellows.
"PRRs are for giant services." The format scales down cleanly. The 18-item version above is the capstone scale; Google's internal PRRs have dozens to hundreds of items per service. The discipline is the same: named items, evidence, a signature.
"The PRR is documentation." It is a gate. Passing it is a decision ("this system is operable enough for production"). Documentation describes how the system works; the PRR asserts that the system is ready to be operated.
"Yellows are fine; nobody reads them." The reviewer does. Three yellows with dates are defensible. Eight yellows is a system that is not ready. The count of yellows is itself a signal.
How To Use It (In Your Capstone)
- Copy the 18-item checklist into
library/raw/prr.md. - For each item, attach a link to the artifact that proves it (the SLO doc, the runbook, the restore log, the IAM policy diff).
- Walk the list in order. Mark green only if the artifact exists and is current.
- For yellows, write the one-line justification and an expected-by date.
- Sign and date. Put the next review in your calendar.
- When anything material changes (schema, dependency, role widening), re-walk the affected items.
- Present the signed PRR as the primary artifact of Module 4 in the portfolio module (S10 M5).
See also (integrative)
- S10 M01 Domain Analysis & Architecture Design -- the architectural decisions captured in ADRs are what the PRR now certifies as operable, not just designed:
../../../module-01-domain-analysis-architecture-design/. - S10 M03 Cloud Deployment & CI/CD -- deploy gates and IaC are the substrate the PRR validates; an item like "supply chain at SLSA L2" only makes sense in that substrate:
../../../module-03-cloud-deployment-ci-cd/. - S10 M05 Portfolio & Specialization Assessment -- the signed PRR is the headline artifact of Module 4 in the portfolio:
../../../module-05-portfolio-specialization-assessment/. - S8 M04 Cluster 5 -- incident lifecycle: the operational lifecycle the PRR asserts you are ready for:
../../../../semester-08-system-design-leadership/module-04-scale-reliability-performance/concepts/cluster-05-incident-and-observability/14-incident-lifecycle-detect-triage-mitigate-resolve-review-primary.md. - S9 M05 Cluster 5 -- operating under observation: the pillar of operating artefacts the PRR's Cluster 1 and 2 rows reference:
../../../../semester-09-cloud-devops/module-05-cloud-security-observability/concepts/cluster-05-operating-under-observation/13-dashboards-that-answer-questions-primary.md. - Google SRE Book -- Production Readiness Review / Engagement Model -- the original PRR framing, the early-engagement model, and the shift from checklist to continuous review.
- Google SRE Book -- Launch Coordination Engineering -- the other side of the same coin: how readiness is built into the launch process, not bolted on at the end.
- AWS Well-Architected -- Operational Excellence pillar -- a parallel industry checklist whose rows map closely onto the 18 items above.
Check Yourself
- Why must a "yellow" item carry both a justification and a date?
- Which two PRR items, if marked green without evidence, are the most likely to bite you during a real incident?
- Why does marking the PRR green for "least privilege verified" require a diff, not just a policy file?
- What is the right response if, during the review, you find yourself with more than three yellows?
- What cadence re-triggers the PRR, and what events re-open it early?
- How does the signed PRR feed into the portfolio module (S10 M5), and what would a portfolio reviewer look at first?
Mini Drill or Application (Capstone-scoped)
This is the headline drill of the module. In 2-3 hours:
- Copy the 18 items above into
library/raw/prr.mdwith your artifact links. - Walk the list. Mark honestly: green / yellow (with justification and date) / red.
- For every red, either fix it now or convert it to a yellow with an explicit accepted-risk rationale (max three yellows allowed; more than that means you are not ready).
- Sign and date. Set the next review for 90 days out.
- Commit the signed PRR. Link it from the portfolio module's hand-off artifact.
- Hand the PRR (and only the PRR) to a peer. Ask them to pick one green item at random and challenge you to produce the evidence in under 2 minutes. Any item that fails that test was not actually green.
Source Backbone
Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.
- Building Secure and Reliable Systems - primary security and reliability backbone.
- Software Engineering at Google - operational process and engineering discipline.
- Designing Distributed Systems - service and reliability pattern support.