Skip to main content

Module 4: Operational Readiness & Security Review

Core references: the Google SRE Book and SRE Workbook, the OpenTelemetry docs, the OWASP threat-modeling body of knowledge, and SLSA for supply-chain framing -- no new required books. Selective support: Building Secure and Reliable Systems and the S8 / S9 material you have already worked through.

This guide is the primary teacher. You do not need to reread the SRE books front-to-back. You do need to leave this module able to defend your capstone under operational and security scrutiny: point to an SLO you actually measure, a dashboard that answers real questions, a threat you have actually mitigated, and a runbook you would be willing to use at 3 a.m.


Scope of This Module

This module is the production-readiness review for your capstone. It is deliberately small, concrete, and judgmental. "We will add observability later" is not a valid answer; this module is where "later" ends.

What it covers in depth:

  • writing one real SLI and one real SLO for your capstone, including the measurement window and the consequence of missing it
  • an error budget at capstone scale -- small but real -- and what it should force you to stop doing
  • alert hygiene: alerting on the SLO and symptoms, not on every graph that moves
  • structured logging placed where it matters, not everywhere by default
  • a three-question dashboard tied to user-visible behavior
  • one end-to-end trace through the critical path of your system
  • a STRIDE pass against your capstone, with at least one letter walked all the way to a mitigation
  • secrets, dependency, and supply-chain hygiene at the level a solo developer can actually maintain
  • least privilege on the cloud accounts, CI, and runtime roles you are actually using
  • the three most likely failures for your capstone and the specific mitigations you will ship
  • retry, circuit-breaker, and graceful-degradation decisions written down with their trade-offs
  • backup and recovery drilled at least once, end to end
  • a runbook for the top three incidents with symptoms, checks, mitigations, and escalation
  • on-call hygiene when the "on-call team" is a single operator
  • a Production Readiness Review checklist you sign before Semester 10 closes

What it deliberately does not try to finish here:

  • a full observability platform migration
  • a formal security audit with an external auditor
  • enterprise-grade compliance programs (SOC 2, ISO 27001)
  • a 12-person rotation or follow-the-sun on-call

This is a capstone-integrative module. It pulls from S6 distributed systems, S8 reliability and performance, and S9 cloud security and observability -- and asks you to cash those concepts in for a concrete, defensible system.


Before You Start

Answer these closed-book before starting the main path:

  1. What is the difference between an SLI, an SLO, and an SLA, and which one does your capstone actually have today?
  2. Why is "CPU is at 95%" usually a bad alert on its own?
  3. What does STRIDE stand for, and which letter does your capstone most likely fail?
  4. If your database disappeared right now, what would you lose and how would you know?
  5. If you were paged at 2 a.m. for your capstone, which single document would you want to open first?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but plan extra time in Cluster 3 (threat model) and Cluster 4 (failure planning).

0-1 solid answers

  • Revisit S8 M04 Scale/Reliability/Performance and S9 M05 Cloud Security & Observability before starting. This module is a compounding review, not a first encounter.

What This Module Is For

By the end of this module, a senior engineer should be able to interrogate your capstone for thirty minutes and leave convinced that:

  • you know what "working" means in measurable terms
  • you know which failures are likely and which you have actually prepared for
  • you know who can reach what, why, and how to revoke it
  • you know how you would find out if things were broken, and what you would do next

That is the bar. Everything else in this module supports it.


Concept Map


How To Use This Module

Work the clusters in order. Each one assumes the previous one is on disk, not just in your head.

Cluster 1: SLOs and Error Budgets

OrderConceptTypeFocus
1Writing One Real SLI and SLO for Your CapstonePRIMARYOne measurable indicator, one target, a window, a consequence
2Error Budget for a Capstone: Small but RealPRIMARYA budget you can actually burn, with a policy attached
3Alert on the SLO, Not EverythingPRIMARYPage on user-visible symptoms and burn rate, not on noise

Cluster mastery check: Can you show one SLI formula, its 30-day target, the consequence of missing it, and the alert attached to its burn rate?

Cluster 2: Observability in Practice

OrderConceptTypeFocus
4Adding Structured Logs Where They MatterPRIMARYLog with fields, not sentences, at decision boundaries
5A Dashboard That Answers 3 Specific QuestionsPRIMARYHealthy? Slow? Failing whom? -- and nothing else
6Tracing the Critical Path End-to-EndPRIMARYOne user request, one trace, every hop attributed

Cluster mastery check: Given an unhappy user report, can you reach the suspect span in under two minutes using only your dashboard and traces?

Cluster 3: Threat Model for the Capstone

OrderConceptTypeFocus
7STRIDE Applied to Your SystemPRIMARYOne letter walked end-to-end to a concrete mitigation
8Secrets, Dependencies, and Supply ChainPRIMARYRotate, pin, scan, attest -- at solo-operator scale
9Least Privilege in Practice -- Not AspirationallyPRIMARYTighten one role until a real task breaks, then widen just enough

Cluster mastery check: Can you name the blast radius of each runtime identity and justify its current scope with a reason, not a default?

Cluster 4: Failure Planning

OrderConceptTypeFocus
10Identifying Your Capstone's Three Most Likely FailuresPRIMARYLikelihood times impact, not imagination times fear
11Mitigations: Retry, Circuit Breaker, Degraded ModePRIMARYThree tools, three trade-offs, one explicit decision per dependency
12Backup and Recovery: the Forgotten BasicsPRIMARYUntested backups are rumors. Restore once, time it, write it down

Cluster mastery check: For each of your three likely failures, can you point to an implemented mitigation and an operator action?

Cluster 5: Runbooks and On-Call

OrderConceptTypeFocus
13Writing a Runbook for the Top 3 IncidentsPRIMARYA 1-page template: symptoms, impact, checks, mitigations, escalation
14On-Call Hygiene for a Solo OperatorSUPPORTINGPaging, escalation, and sustainability when the team is you
15Production Readiness Review -- a Capstone ChecklistPRIMARYThe 12-20 item gate before you call the capstone "production ready"

Cluster mastery check: Can you sign a PRR for your capstone today, and if not, name the exact items that would need to be green first?

Then work these practice pages:

OrderPractice pathFocus
1SLO and Alert LabDefine an SLI, an SLO, an error-budget policy, and one burn-rate alert
2Observability Instrumentation WorkshopLogs, dashboards, and one trace through the critical path
3Threat Model and Security ClinicSTRIDE, secrets, supply chain, least privilege
4Operational KatasSLI+alert, trace, STRIDE, runbook -- repeated until fluent

Use Module Quiz after the concept and practice path. Use Reference and Learning Resources for targeted reinforcement only.


Learning Objectives

By the end of this module you should be able to:

  1. Write one SLI as a formal expression over success and total events, with a target, a window, and a documented consequence for missing it.
  2. Compute an error budget for a given SLO and state what it should force you to stop doing when half consumed.
  3. Construct a burn-rate alert that pages on user-visible symptoms rather than low-level resource metrics.
  4. Place structured logs at decision boundaries with stable field names suitable for aggregation.
  5. Build a dashboard that answers exactly three questions about your capstone's current state.
  6. Instrument the critical path of your capstone with a distributed trace whose spans cover every significant hop.
  7. Run a STRIDE pass on your system and walk at least one threat to an implemented mitigation.
  8. Describe your secrets lifecycle, dependency policy, and supply-chain posture in one page.
  9. Tighten at least one IAM or application role until a real task breaks, then widen just enough to unblock it.
  10. Name the three most likely failures of your capstone with likelihood, impact, and a mitigation for each.
  11. Draft a retry, circuit-breaker, and graceful-degradation decision per external dependency with its trade-offs.
  12. Execute and document one real backup-and-restore drill for your capstone's primary data store.
  13. Write a runbook for the top three incidents using the symptoms / impact / checks / mitigations / escalation template.
  14. Describe a sustainable on-call posture for a solo operator, including what not to page on.
  15. Fill out a Production Readiness Review checklist and defend every item you marked green.

Outputs

  • one SLO document for your capstone: SLI, target, window, error budget, consequence, and a burn-rate alert
  • one dashboard screenshot that answers the three named questions
  • one distributed trace of a user-visible critical path, annotated with span-level latencies
  • one completed STRIDE worksheet with at least one threat walked to a deployed mitigation
  • one "secrets and supply chain" one-pager: rotation, scanning, pinning, attestation
  • one least-privilege diff: before and after role policy, with the breakage that forced the widening
  • one "three most likely failures" memo with likelihood, impact, and mitigation per failure
  • one backup-and-restore log showing an end-to-end drill with measured RTO and RPO
  • three runbooks (one per top incident) using the standard 1-page template
  • one completed Production Readiness Review checklist, signed and dated

Completion Standard

You have completed Module 4 when all of these are true:

  • your capstone has at least one SLO measured by real data, not aspiration
  • your alerts fire on user-visible symptoms tied to that SLO
  • your dashboard answers the three questions you would want answered at 2 a.m.
  • a trace exists for your critical path and you know how to open it
  • STRIDE has been run, and at least one threat has a deployed mitigation you can point to
  • secrets, dependencies, and supply chain each have a stated policy you actually follow
  • at least one runtime role has been tightened until a task broke, then widened just enough
  • your three likely failures each have a mitigation and an operator action
  • a backup restore has been drilled end to end with a measured time
  • three runbooks exist and a trusted peer says they could follow them cold
  • you have signed a PRR for the capstone and could defend every item

If any of those is "we will get to it," the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • See also (integrative) at the end of each concept points to 1-2 prior semester modules and one validated external source. It is not a reading list.
  • Prefer the SRE book / workbook, OpenTelemetry docs, OWASP, and SLSA for ground truth. Treat blogs as commentary.
  • If you find yourself reading instead of writing down an SLO, a runbook, or a threat, stop reading.

Suggested Weekly Flow (Week 94)

DayWork
1Concepts 1-3 and draft one SLI+SLO+alert for the capstone
2Concepts 4-6 and instrument one trace plus the three-question dashboard
3Concepts 7-9 and complete a STRIDE pass with at least one mitigation landed
4Concepts 10-12 and drill one backup restore end to end
5Concepts 13-15 and write the three runbooks
6Practice pages 1-3, fill the PRR checklist, and fix the top three red items
7Practice 4 (katas), quiz, and sign the PRR

Reference

If you need a concept-to-source map or escalation links grouped by cluster, use Reference.


Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread