Production Readiness Review -- a Capstone Checklist

What This Concept Is

A Production Readiness Review (PRR) is a structured gate: before a service is treated as "in production," a checklist is walked item by item, and each item must be green (done), yellow (accepted risk, documented), or red (blocker). At Google's original scale the PRR is run by an SRE team on a launching service; at capstone scale it is run by you, on yourself, honestly.

This concept contains the actual checklist you will walk. It distills every preceding concept in this module into a single sheet. The PRR is the artifact a senior engineer or examiner looks at first -- if it is defensible, the rest of the defense becomes tractable.

The PRR is not a rubber stamp. Any "yellow" requires a one-line written justification and a date by which the risk will be retired; any "red" either blocks launch or becomes an accepted launch-risk with a dated follow-up. Rigour is what distinguishes a PRR from a self-congratulation exercise.

PRRs also have a reviewing character: in industry they are walked by someone other than the author. In a capstone you play both roles -- author and reviewer -- which means you must read the checklist with the skepticism a senior engineer would bring, not the optimism of the person who built the thing.

Why It Matters Here (In the Capstone)

Everything in Semester 10 converges here. You have designed (M1), built (M2), deployed (M3), and now must certify that what you deployed is operable and defensible. The PRR is how you certify it. The portfolio module (M5) points back to a signed PRR as evidence.

A capstone without a PRR is a demo. A capstone with one is a system.

Concrete Example -- the Checklist

File at library/raw/prr.md. Walk this line-by-line before declaring the capstone ready. 18 items, grouped by cluster.

SLOs and Error Budgets (Cluster 1)

1. SLI defined -- one SLI is written as good / total over a specific event source. (Concept 1)
2. SLO target and window -- target and rolling window documented; target is defensible given actual traffic. (Concept 1)
3. Error budget + policy -- budget computed in events, 5-tier policy ladder in library/raw/error-budget-policy.md. (Concept 2)
4. Burn-rate alerts wired -- at least one fast-burn (page) and one slow-burn (ticket) multi-window alert. (Concept 3)

Observability in Practice (Cluster 2)

5. Structured logs at decision boundaries -- event naming schema in library/raw/logging.md, fields stable across services. (Concept 4)
6. Three-question dashboard -- single dashboard answers "healthy? slow? failing whom?" in < 10s. (Concept 5)
7. Critical-path trace -- one full distributed trace of the SLO path, sampling policy documented, errors always retained. (Concept 6)

Threat Model (Cluster 3)

8. STRIDE worksheet -- DFD + STRIDE table for at least the top trust-boundary flow; at least one letter walked all the way to a deployed mitigation. (Concept 7)
9. Secrets policy enforced -- secrets in a manager (not in .env); rotation cadence documented; pre-commit + CI leak scanning live. (Concept 8)
10. Dependency scanning live -- CI fails on HIGH/CRITICAL CVEs; pinned versions with hashes. (Concept 8)
11. Supply chain at SLSA L2 -- signed provenance on build; attestation verified at deploy. (Concept 8)
12. Least privilege verified -- every runtime and CI identity reviewed; at least one role tightened until it broke and widened just enough; library/raw/iam.md up to date. (Concept 9)

Failure Planning (Cluster 4)

13. Top-three failures named -- library/raw/top-failures.md lists three likely-and-impactful failures with first-check guidance. (Concept 10)
14. Retry / breaker / degraded-mode decisions -- library/raw/reliability-decisions.md per external dependency; tested in staging. (Concept 11)
15. Backup + restore drill -- one restore executed end to end within the last 30 days; log in library/raw/recovery.md; RPO / RTO documented. (Concept 12)

Runbooks and On-Call (Cluster 5)

16. Three runbooks -- one per top failure, using the five-section template; peer-read and confirmed followable. (Concept 13)
17. On-call posture documented -- coverage hours, page-vs-ticket rules, kill switch, fallback person. (Concept 14)
18. PRR signed -- you have read every item above, every box is green or yellow-with-reason, and you sign and date this file. (this concept)

At the bottom of library/raw/prr.md:

PRR signature
-------------
Signed: <your name>
Date:   YYYY-MM-DD
Next review: YYYY-MM-DD (+90d)

Outstanding risks (yellow items):
  - <item>: <reason accepted> -- revisit by <date>

Evidence discipline. Every green must link to an artifact: the SLO doc, the Grafana dashboard URL, the runbook, the CI log showing the vuln-scan gate, the Terraform plan showing least-privilege policy, the restore log. A green without evidence is a yellow masquerading.

Operating cadence after signature. PRR is not a one-time event. Set a calendar reminder 90 days out. Any material architectural change -- new dependency, new data store, new identity, change in traffic profile -- triggers an early re-review of the affected rows. Between reviews, any incident updates the relevant runbook and may reopen rows 13-16.

Common Confusion / Misconceptions

"A PRR is a one-time event." It is a gate and a recurring review. Schedule the next PRR review 90 days out. Any significant architectural change triggers an early re-review.

"I'll mark items green to pass my own review." The PRR is only as honest as the operator. Under-reporting now becomes an incident later. If a senior engineer reads your PRR and the artifacts do not support the green marks, trust is lost faster than with a few honest yellows.

"PRRs are for giant services." The format scales down cleanly. The 18-item version above is the capstone scale; Google's internal PRRs have dozens to hundreds of items per service. The discipline is the same: named items, evidence, a signature.

"The PRR is documentation." It is a gate. Passing it is a decision ("this system is operable enough for production"). Documentation describes how the system works; the PRR asserts that the system is ready to be operated.

"Yellows are fine; nobody reads them." The reviewer does. Three yellows with dates are defensible. Eight yellows is a system that is not ready. The count of yellows is itself a signal.

How To Use It (In Your Capstone)

Copy the 18-item checklist into library/raw/prr.md.
For each item, attach a link to the artifact that proves it (the SLO doc, the runbook, the restore log, the IAM policy diff).
Walk the list in order. Mark green only if the artifact exists and is current.
For yellows, write the one-line justification and an expected-by date.
Sign and date. Put the next review in your calendar.
When anything material changes (schema, dependency, role widening), re-walk the affected items.
Present the signed PRR as the primary artifact of Module 4 in the portfolio module (S10 M5).

Check Yourself

Why must a "yellow" item carry both a justification and a date?
Which two PRR items, if marked green without evidence, are the most likely to bite you during a real incident?
Why does marking the PRR green for "least privilege verified" require a diff, not just a policy file?
What is the right response if, during the review, you find yourself with more than three yellows?
What cadence re-triggers the PRR, and what events re-open it early?
How does the signed PRR feed into the portfolio module (S10 M5), and what would a portfolio reviewer look at first?

Mini Drill or Application (Capstone-scoped)

This is the headline drill of the module. In 2-3 hours:

Copy the 18 items above into library/raw/prr.md with your artifact links.
Walk the list. Mark honestly: green / yellow (with justification and date) / red.
For every red, either fix it now or convert it to a yellow with an explicit accepted-risk rationale (max three yellows allowed; more than that means you are not ready).
Sign and date. Set the next review for 90 days out.
Commit the signed PRR. Link it from the portfolio module's hand-off artifact.
Hand the PRR (and only the PRR) to a peer. Ask them to pick one green item at random and challenge you to produce the evidence in under 2 minutes. Any item that fails that test was not actually green.

Source Backbone

Capstone operations applies security, reliability, and distributed-systems material. These books are the source backbone for readiness review.

Building Secure and Reliable Systems - primary security and reliability backbone.
Software Engineering at Google - operational process and engineering discipline.
Designing Distributed Systems - service and reliability pattern support.

What This Concept Is​

Why It Matters Here (In the Capstone)​

Concrete Example -- the Checklist​

SLOs and Error Budgets (Cluster 1)​

Observability in Practice (Cluster 2)​

Threat Model (Cluster 3)​

Failure Planning (Cluster 4)​

Runbooks and On-Call (Cluster 5)​

Common Confusion / Misconceptions​

How To Use It (In Your Capstone)​

See also (integrative)​

Check Yourself​

Mini Drill or Application (Capstone-scoped)​

Source Backbone​