Reference and Selective Reading
This module has no new required books. The concept pages are the main path. Use this page only to find the authoritative source when a concept page leaves a gap, and to see how the module maps to earlier semesters.
Source Roles
| Source | Role | Why it is here |
|---|---|---|
| Google SRE Book (sre.google/sre-book) | Primary teaching source for SLOs, alerting, monitoring, PRR | Canonical framing for the operational clusters |
| Google SRE Workbook (sre.google/workbook) | Selective extension of the SRE book | Implementation details for SLOs, error budgets, and incident response |
| OpenTelemetry docs (opentelemetry.io/docs) | Official standard for tracing, metrics, logs | Ground truth for instrumentation and propagation |
| OWASP -- Threat Modeling | Primary teaching source for STRIDE | Authoritative framework definitions |
| SLSA | Supply-chain framework | Anchor for "SLSA Build L2" goal |
| Microsoft Azure Architecture Center | Cloud patterns (circuit breaker, retry, bulkhead) | Canonical state-machine descriptions |
| Prior semesters (S6, S8, S9) | Internal prerequisites | What this module integrates and applies |
Read Only If Stuck
SLOs, Error Budgets, Alerting
- Google SRE Book -- Service Level Objectives
- Google SRE Workbook -- Implementing SLOs
- Google SRE Book -- Practical Alerting
- Google SRE Book -- Monitoring Distributed Systems
Observability in Practice
- OpenTelemetry -- Concepts (signals, sampling, propagation, semantic conventions)
- Google SRE Book -- Monitoring Distributed Systems
Threat Model, Secrets, Supply Chain, Least Privilege
- OWASP -- Threat Modeling
- SLSA
- provider IAM docs (AWS IAM User Guide / GCP IAM / Azure AD) -- authoritative for exact policy semantics
- tool docs:
gitleaks,trufflehog
Failure Planning, Backup, Runbooks, PRR
- Microsoft Azure Architecture Center -- Circuit Breaker Pattern
- Google SRE Workbook -- Incident Response
- Google SRE Book -- Evolving SRE Engagement Model (PRR)
- provider backup / PITR docs for your data store
Optional Deep Dive
- Building Secure and Reliable Systems (Google, free online) -- long-form cross-topic framing of security + reliability
- Site Reliability Engineering book (full SRE book) -- for any chapter on testing, capacity, or emergency response you did not cover here
- OpenTelemetry semantic conventions -- for cross-service attribute naming at scale
Cross-Semester References
| Module 4 cluster | Prior semester module(s) it integrates |
|---|---|
| Cluster 1 (SLOs, error budgets, alerts) | S8 M04 Scale, Reliability, and Performance -- SLOs, symptom-based alerting |
| Cluster 2 (observability) | S8 M05 Observability and Debugging Under Production Pressure; S9 M05 Cloud Security & Observability |
| Cluster 3 (threat model, secrets, supply chain, least privilege) | S9 M01 Cloud Platform Fundamentals (IAM); S9 M05 Cloud Security & Observability |
| Cluster 4 (failure planning, retry/breaker/degraded, backup) | S6 M05 Distributed Systems Fundamentals (partial failure, timeouts); S8 M04 Scale, Reliability, and Performance |
| Cluster 5 (runbooks, on-call, PRR) | S8 M04 Scale, Reliability, and Performance; S10 M01/M02/M03 (what PRR now certifies) |
Concept-to-Source Map
| Primary concept | Best source if stuck | Why this source |
|---|---|---|
| Writing one real SLI and SLO for your capstone | Google SRE Book -- SLOs | Canonical definitions and "how many nines" |
| Error budget for a capstone: small but real | Google SRE Workbook -- Implementing SLOs | Policy ladder and decision matrix |
| Alert on the SLO, not everything | Google SRE Book -- Practical Alerting | Burn-rate pattern and alert hygiene |
| Structured logs where they matter | OpenTelemetry -- Concepts | Signals model and stable attribute naming |
| Dashboard that answers 3 specific questions | Google SRE Book -- Monitoring Distributed Systems | Four golden signals that drive the three-question layout |
| Tracing the critical path end-to-end | OpenTelemetry -- Concepts | Spans, propagation, sampling, semantic conventions |
| STRIDE applied to your system | OWASP -- Threat Modeling | Most authoritative and concise framing |
| Secrets, dependencies, and supply chain | SLSA | Supply-chain levels and provenance concepts |
| Least privilege in practice | provider IAM docs + SRE Book -- Engagement Model | Provider is authoritative for policy semantics; SRE book frames review |
| Three most likely failures | Google SRE Workbook -- Implementing SLOs | Error-budget/failure-likelihood framing |
| Retry, circuit breaker, degraded mode | Azure -- Circuit Breaker | Canonical state-machine pattern |
| Backup and recovery: the forgotten basics | provider backup docs + SRE Workbook -- Incident Response | Provider for backup details; SRE for drill discipline |
| Writing a runbook for the top 3 incidents | SRE Workbook -- Incident Response | ICS-derived roles and declarations |
| On-call hygiene for a solo operator | SRE Workbook -- Incident Response | Role-collapsing strategies and sustainable paging |
| Production readiness review | SRE Book -- Engagement Model | Origin of the PRR pattern |