Learning Resources
This module has no new required books. It integrates material from S6, S8, and S9 and directs the learner to a small set of authoritative external sources for each cluster. Use this page as a source map, not as an instruction to read all of it.
Canonical Book Backbone
Use these canonical book routes when you need a book-backed source of truth for reliability, distributed behavior, and secure operations:
- Building Secure and Reliable Systems for the joint security-and-reliability lens that this module expects.
- DDIA for replication, failure, consistency, and operational tradeoffs behind SLOs and incident thinking.
- Designing Distributed Systems for coordination, resilience, and production-pattern reasoning.
- Computer Networking for protocol-level behavior behind latency, TLS, and transport-visible failures.
Source Stack
| Source | Role | How to use it in this module |
|---|---|---|
| Google SRE Book (sre.google/sre-book) | Primary teaching source for SLOs, alerting, monitoring, and PRR | Default escalation for any SLO/alert/monitoring question |
| Google SRE Workbook (sre.google/workbook) | Practical extension of the SRE book | Use for error-budget policy, SLO implementation, and incident-response roles |
| OpenTelemetry docs (opentelemetry.io/docs) | Official standard for tracing, metrics, logs | The source of truth for spans, context propagation, sampling, semantic conventions |
| OWASP -- Threat Modeling (owasp.org) | Primary teaching source for STRIDE and threat-model framing | Use for the four-question framework and category definitions |
| SLSA (slsa.dev) | Framework for supply-chain integrity | Use for SLSA Build L2 target and provenance patterns |
| Microsoft Azure Architecture Center (learn.microsoft.com/azure/architecture) | Official cloud-pattern catalog | Use for circuit breaker, retry, bulkhead, and related resilience patterns |
| Building Secure and Reliable Systems (Google, free online) | Selective reinforcement | Open only for cross-topic framing (security + reliability) when the Cluster 3 concept pages are insufficient |
Resource Map by Cluster
Cluster 1: SLOs and Error Budgets
| Need | Best source | Why |
|---|---|---|
| SLI / SLO / SLA definitions and framing | Google SRE Book -- Service Level Objectives | Canonical explanation with "how many nines" framing |
| SLO implementation details and decision matrix | Google SRE Workbook -- Implementing SLOs | Error-budget policy tiers and stakeholder alignment |
| Multi-window multi-burn-rate alerting | Google SRE Book -- Practical Alerting | Origin of the burn-rate alert pattern |
| Golden-signals grounding | Google SRE Book -- Monitoring Distributed Systems | Latency, traffic, errors, saturation |
Cluster 2: Observability in Practice
| Need | Best source | Why |
|---|---|---|
| Signals model (logs / metrics / traces) | OpenTelemetry -- Concepts | Canonical definitions and relationships |
| Semantic conventions for field names | OpenTelemetry -- Concepts (sem-conv section) | Stable attribute naming across services |
| Dashboard layout intuition | Google SRE Book -- Monitoring Distributed Systems | Four golden signals structure informs the three-question layout |
Cluster 3: Threat Model for the Capstone
| Need | Best source | Why |
|---|---|---|
| STRIDE + four-question framework | OWASP -- Threat Modeling | Most authoritative short intro |
| Supply-chain framework (SLSA levels) | SLSA | Level definitions and practitioner guidance |
| Secret rotation / detection tooling | tool docs: gitleaks, trufflehog | Official scanner docs -- treat as authoritative for usage |
| Least privilege patterns (cloud) | provider docs (AWS IAM, GCP IAM, Azure AD) | Always authoritative for exact policy semantics |
Cluster 4: Failure Planning
| Need | Best source | Why |
|---|---|---|
| Retry, circuit breaker, bulkhead patterns | Microsoft Azure Architecture Center -- Circuit Breaker | Canonical state-machine description and considerations |
| Incident declaration and coordination | Google SRE Workbook -- Incident Response | ICS-derived roles applicable at any scale |
| Backup and recovery sanity | provider docs (RDS automated backups, PITR) | Ground truth for your specific data store |
| RPO/RTO framing | Google SRE Workbook -- Implementing SLOs | Connects reliability targets to recovery expectations |
Cluster 5: Runbooks and On-Call
| Need | Best source | Why |
|---|---|---|
| Runbook structure + on-call roles | Google SRE Workbook -- Incident Response | Best short source for incident-command patterns |
| PRR framing and engagement model | Google SRE Book -- Evolving SRE Engagement Model | Origin of the PRR pattern and the shift to continuous review |
| Burn-rate + policy at incident time | Google SRE Workbook -- Implementing SLOs | Policy ladder application during incidents |
Use Rules
- Prefer the SRE book / workbook for SLO, alert, monitoring, and PRR questions.
- Prefer OpenTelemetry docs for anything tracing, span, propagation, or sampling related.
- Prefer provider docs for IAM, backup, and cloud-specific behavior.
- Open one link at a time, with a specific question in mind. Do not read chapters by default.
- If a concept page already answered the question, do not read the source -- write the runbook instead.