External Exercises
This module has no leetcode-style problems; the exercise is your own capstone. These lanes point to external reading-and-doing sets that build specific fluency when the concept pages are not enough. Work each lane against your system, not a toy.
How To Use This Page
- Finish the relevant concept page and the matching practice page first.
- Pick a lane whose output you are still uncomfortable producing from scratch.
- Do the lane with your capstone repo open. Deliverable is a commit to your capstone, not notes.
- Maintain a mistake log with tags such as
wrong SLI granularity,alert noise,unstructured log,missing trace hop,STRIDE gap missed,over-permissive role,untested backup,runbook missing rollback.
Lane 1: SLOs, Error Budgets, and Alerts
Use this lane when your SLO is aspirational or your alerts are noisy.
- Google SRE Book -- Service Level Objectives (skim chapter 4)
- Google SRE Workbook -- Implementing SLOs (work through "How to use error budgets")
- Google SRE Book -- Practical Alerting (read through "Alerting at the right level")
Target outcomes:
- one committed
library/raw/slo.md - one committed
library/raw/error-budget-policy.md - at least one multi-window burn-rate alert live in your monitoring tool
- a list of at least three non-SLO page-level alerts you have demoted to ticket or deleted
Lane 2: Observability
Use this lane when you cannot reach the suspect span from an alert in under two minutes.
- OpenTelemetry -- Concepts (signals, context propagation, sampling, semantic conventions)
- Google SRE Book -- Monitoring Distributed Systems (four golden signals and white-box vs black-box)
Target outcomes:
library/raw/logging.mdwith a named field schema- one commit that replaces at least five string logs with structured events
- a
capstone-livedashboard with three labeled rows answering the three questions - one real distributed trace stored and linkable by URL
library/raw/tracing.mdwith sampling policy and runbook-linking convention
Lane 3: Threat Model, Secrets, Supply Chain, Least Privilege
Use this lane when your security posture is "probably fine."
- OWASP -- Threat Modeling (four-question framework + STRIDE)
- SLSA (levels and requirements)
- Secret scanners:
gitleaksandtrufflehog - Provider IAM docs for your cloud (AWS IAM User Guide, GCP IAM, Azure AD)
Target outcomes:
- one committed STRIDE worksheet with a full walk on one gap
- one committed
library/raw/security-policy.md - CI step that fails on HIGH/CRITICAL dependency CVEs
- at least one artifact carries signed build provenance
- one IAM role diff committed, with the breakage-and-widening log
Lane 4: Failure Planning, Backup, Runbooks
Use this lane when "what happens when X fails?" returns vague answers.
- Microsoft Azure Architecture Center -- Circuit Breaker
- Google SRE Workbook -- Incident Response
- provider backup / PITR docs for your data store
Target outcomes:
library/raw/top-failures.mdwith three prioritized failureslibrary/raw/reliability-decisions.mdper external dependencylibrary/raw/recovery.mdwith a dated restore-drill log- three runbooks in
library/raw/runbooks/*using the five-section template library/raw/on-call.mdwith coverage, page-vs-ticket rules, and a kill switch
Self-Curated Problem Set
Build a custom set around real incidents in your capstone's staging history:
- 3 staging incidents in the last 60 days -- what was the first-seen symptom and what was the actual cause?
- 3 near-miss deploys -- what caught them, and what alert would have caught them automatically?
- 3 cloud bills that surprised you -- which came from observability, backup, or logging, and is the trade-off still worth it?
These become postmortems, test cases, and PRR yellows -- whichever fits best.
Completion Checklist
- Completed at least one lane in full with artifacts committed
- Logged at least 10 real mistakes and corrections in the mistake journal
- Walked the 18-item PRR and signed or red-listed each item honestly
- At least one peer has validated the top runbook and the SLO document