Module 4: Operational Readiness & Security Review

Core references: the Google SRE Book and SRE Workbook, the OpenTelemetry docs, the OWASP threat-modeling body of knowledge, and SLSA for supply-chain framing -- no new required books. Selective support: Building Secure and Reliable Systems and the S8 / S9 material you have already worked through.

This guide is the primary teacher. You do not need to reread the SRE books front-to-back. You do need to leave this module able to defend your capstone under operational and security scrutiny: point to an SLO you actually measure, a dashboard that answers real questions, a threat you have actually mitigated, and a runbook you would be willing to use at 3 a.m.

Scope of This Module

This module is the production-readiness review for your capstone. It is deliberately small, concrete, and judgmental. "We will add observability later" is not a valid answer; this module is where "later" ends.

What it covers in depth:

writing one real SLI and one real SLO for your capstone, including the measurement window and the consequence of missing it
an error budget at capstone scale -- small but real -- and what it should force you to stop doing
alert hygiene: alerting on the SLO and symptoms, not on every graph that moves
structured logging placed where it matters, not everywhere by default
a three-question dashboard tied to user-visible behavior
one end-to-end trace through the critical path of your system
a STRIDE pass against your capstone, with at least one letter walked all the way to a mitigation
secrets, dependency, and supply-chain hygiene at the level a solo developer can actually maintain
least privilege on the cloud accounts, CI, and runtime roles you are actually using
the three most likely failures for your capstone and the specific mitigations you will ship
retry, circuit-breaker, and graceful-degradation decisions written down with their trade-offs
backup and recovery drilled at least once, end to end
a runbook for the top three incidents with symptoms, checks, mitigations, and escalation
on-call hygiene when the "on-call team" is a single operator
a Production Readiness Review checklist you sign before Semester 10 closes

What it deliberately does not try to finish here:

a full observability platform migration
a formal security audit with an external auditor
enterprise-grade compliance programs (SOC 2, ISO 27001)
a 12-person rotation or follow-the-sun on-call

This is a capstone-integrative module. It pulls from S6 distributed systems, S8 reliability and performance, and S9 cloud security and observability -- and asks you to cash those concepts in for a concrete, defensible system.

Before You Start

Answer these closed-book before starting the main path:

What is the difference between an SLI, an SLO, and an SLA, and which one does your capstone actually have today?
Why is "CPU is at 95%" usually a bad alert on its own?
What does STRIDE stand for, and which letter does your capstone most likely fail?
If your database disappeared right now, what would you lose and how would you know?
If you were paged at 2 a.m. for your capstone, which single document would you want to open first?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path.

2-3 solid answers

Continue, but plan extra time in Cluster 3 (threat model) and Cluster 4 (failure planning).

0-1 solid answers

Revisit S8 M04 Scale/Reliability/Performance and S9 M05 Cloud Security & Observability before starting. This module is a compounding review, not a first encounter.

What This Module Is For

By the end of this module, a senior engineer should be able to interrogate your capstone for thirty minutes and leave convinced that:

you know what "working" means in measurable terms
you know which failures are likely and which you have actually prepared for
you know who can reach what, why, and how to revoke it
you know how you would find out if things were broken, and what you would do next

That is the bar. Everything else in this module supports it.

Concept Map

How To Use This Module

Work the clusters in order. Each one assumes the previous one is on disk, not just in your head.

Cluster 1: SLOs and Error Budgets

Order	Concept	Type	Focus
1	Writing One Real SLI and SLO for Your Capstone	PRIMARY	One measurable indicator, one target, a window, a consequence
2	Error Budget for a Capstone: Small but Real	PRIMARY	A budget you can actually burn, with a policy attached
3	Alert on the SLO, Not Everything	PRIMARY	Page on user-visible symptoms and burn rate, not on noise

Cluster mastery check: Can you show one SLI formula, its 30-day target, the consequence of missing it, and the alert attached to its burn rate?

Cluster 2: Observability in Practice

Order	Concept	Type	Focus
4	Adding Structured Logs Where They Matter	PRIMARY	Log with fields, not sentences, at decision boundaries
5	A Dashboard That Answers 3 Specific Questions	PRIMARY	Healthy? Slow? Failing whom? -- and nothing else
6	Tracing the Critical Path End-to-End	PRIMARY	One user request, one trace, every hop attributed

Cluster mastery check: Given an unhappy user report, can you reach the suspect span in under two minutes using only your dashboard and traces?

Cluster 3: Threat Model for the Capstone

Order	Concept	Type	Focus
7	STRIDE Applied to Your System	PRIMARY	One letter walked end-to-end to a concrete mitigation
8	Secrets, Dependencies, and Supply Chain	PRIMARY	Rotate, pin, scan, attest -- at solo-operator scale
9	Least Privilege in Practice -- Not Aspirationally	PRIMARY	Tighten one role until a real task breaks, then widen just enough

Cluster mastery check: Can you name the blast radius of each runtime identity and justify its current scope with a reason, not a default?

Cluster 4: Failure Planning

Order	Concept	Type	Focus
10	Identifying Your Capstone's Three Most Likely Failures	PRIMARY	Likelihood times impact, not imagination times fear
11	Mitigations: Retry, Circuit Breaker, Degraded Mode	PRIMARY	Three tools, three trade-offs, one explicit decision per dependency
12	Backup and Recovery: the Forgotten Basics	PRIMARY	Untested backups are rumors. Restore once, time it, write it down

Cluster mastery check: For each of your three likely failures, can you point to an implemented mitigation and an operator action?

Cluster 5: Runbooks and On-Call

Order	Concept	Type	Focus
13	Writing a Runbook for the Top 3 Incidents	PRIMARY	A 1-page template: symptoms, impact, checks, mitigations, escalation
14	On-Call Hygiene for a Solo Operator	SUPPORTING	Paging, escalation, and sustainability when the team is you
15	Production Readiness Review -- a Capstone Checklist	PRIMARY	The 12-20 item gate before you call the capstone "production ready"

Cluster mastery check: Can you sign a PRR for your capstone today, and if not, name the exact items that would need to be green first?

Then work these practice pages:

Order	Practice path	Focus
1	SLO and Alert Lab	Define an SLI, an SLO, an error-budget policy, and one burn-rate alert
2	Observability Instrumentation Workshop	Logs, dashboards, and one trace through the critical path
3	Threat Model and Security Clinic	STRIDE, secrets, supply chain, least privilege
4	Operational Katas	SLI+alert, trace, STRIDE, runbook -- repeated until fluent

Use Module Quiz after the concept and practice path. Use Reference and Learning Resources for targeted reinforcement only.

Learning Objectives

By the end of this module you should be able to:

Write one SLI as a formal expression over success and total events, with a target, a window, and a documented consequence for missing it.
Compute an error budget for a given SLO and state what it should force you to stop doing when half consumed.
Construct a burn-rate alert that pages on user-visible symptoms rather than low-level resource metrics.
Place structured logs at decision boundaries with stable field names suitable for aggregation.
Build a dashboard that answers exactly three questions about your capstone's current state.
Instrument the critical path of your capstone with a distributed trace whose spans cover every significant hop.
Run a STRIDE pass on your system and walk at least one threat to an implemented mitigation.
Describe your secrets lifecycle, dependency policy, and supply-chain posture in one page.
Tighten at least one IAM or application role until a real task breaks, then widen just enough to unblock it.
Name the three most likely failures of your capstone with likelihood, impact, and a mitigation for each.
Draft a retry, circuit-breaker, and graceful-degradation decision per external dependency with its trade-offs.
Execute and document one real backup-and-restore drill for your capstone's primary data store.
Write a runbook for the top three incidents using the symptoms / impact / checks / mitigations / escalation template.
Describe a sustainable on-call posture for a solo operator, including what not to page on.
Fill out a Production Readiness Review checklist and defend every item you marked green.

Outputs

one SLO document for your capstone: SLI, target, window, error budget, consequence, and a burn-rate alert
one dashboard screenshot that answers the three named questions
one distributed trace of a user-visible critical path, annotated with span-level latencies
one completed STRIDE worksheet with at least one threat walked to a deployed mitigation
one "secrets and supply chain" one-pager: rotation, scanning, pinning, attestation
one least-privilege diff: before and after role policy, with the breakage that forced the widening
one "three most likely failures" memo with likelihood, impact, and mitigation per failure
one backup-and-restore log showing an end-to-end drill with measured RTO and RPO
three runbooks (one per top incident) using the standard 1-page template
one completed Production Readiness Review checklist, signed and dated

Completion Standard

You have completed Module 4 when all of these are true:

your capstone has at least one SLO measured by real data, not aspiration
your alerts fire on user-visible symptoms tied to that SLO
your dashboard answers the three questions you would want answered at 2 a.m.
a trace exists for your critical path and you know how to open it
STRIDE has been run, and at least one threat has a deployed mitigation you can point to
secrets, dependencies, and supply chain each have a stated policy you actually follow
at least one runtime role has been tightened until a task broke, then widened just enough
your three likely failures each have a mitigation and an operator action
a backup restore has been drilled end to end with a measured time
three runbooks exist and a trusted peer says they could follow them cold
you have signed a PRR for the capstone and could defend every item

If any of those is "we will get to it," the module is not complete.

Reading Policy

Concept pages are the main path.
See also (integrative) at the end of each concept points to 1-2 prior semester modules and one validated external source. It is not a reading list.
Prefer the SRE book / workbook, OpenTelemetry docs, OWASP, and SLSA for ground truth. Treat blogs as commentary.
If you find yourself reading instead of writing down an SLO, a runbook, or a threat, stop reading.

Suggested Weekly Flow (Week 94)

Day	Work
1	Concepts 1-3 and draft one SLI+SLO+alert for the capstone
2	Concepts 4-6 and instrument one trace plus the three-question dashboard
3	Concepts 7-9 and complete a STRIDE pass with at least one mitigation landed
4	Concepts 10-12 and drill one backup restore end to end
5	Concepts 13-15 and write the three runbooks
6	Practice pages 1-3, fill the PRR checklist, and fix the top three red items
7	Practice 4 (katas), quiz, and sign the PRR

Reference

If you need a concept-to-source map or escalation links grouped by cluster, use Reference.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: SLOs and Error Budgets​

Cluster 2: Observability in Practice​

Cluster 3: Threat Model for the Capstone​

Cluster 4: Failure Planning​

Cluster 5: Runbooks and On-Call​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow (Week 94)​

Reference​

Rich Learning Pages​