Learning Resources

This module has no new required books. It integrates material from S6, S8, and S9 and directs the learner to a small set of authoritative external sources for each cluster. Use this page as a source map, not as an instruction to read all of it.

Canonical Book Backbone

Use these canonical book routes when you need a book-backed source of truth for reliability, distributed behavior, and secure operations:

Building Secure and Reliable Systems for the joint security-and-reliability lens that this module expects.
DDIA for replication, failure, consistency, and operational tradeoffs behind SLOs and incident thinking.
Designing Distributed Systems for coordination, resilience, and production-pattern reasoning.
Computer Networking for protocol-level behavior behind latency, TLS, and transport-visible failures.

Source Stack

Source	Role	How to use it in this module
Google SRE Book (sre.google/sre-book)	Primary teaching source for SLOs, alerting, monitoring, and PRR	Default escalation for any SLO/alert/monitoring question
Google SRE Workbook (sre.google/workbook)	Practical extension of the SRE book	Use for error-budget policy, SLO implementation, and incident-response roles
OpenTelemetry docs (opentelemetry.io/docs)	Official standard for tracing, metrics, logs	The source of truth for spans, context propagation, sampling, semantic conventions
OWASP -- Threat Modeling (owasp.org)	Primary teaching source for STRIDE and threat-model framing	Use for the four-question framework and category definitions
SLSA (slsa.dev)	Framework for supply-chain integrity	Use for SLSA Build L2 target and provenance patterns
Microsoft Azure Architecture Center (learn.microsoft.com/azure/architecture)	Official cloud-pattern catalog	Use for circuit breaker, retry, bulkhead, and related resilience patterns
Building Secure and Reliable Systems (Google, free online)	Selective reinforcement	Open only for cross-topic framing (security + reliability) when the Cluster 3 concept pages are insufficient

Resource Map by Cluster

Cluster 1: SLOs and Error Budgets

Need	Best source	Why
SLI / SLO / SLA definitions and framing	Google SRE Book -- Service Level Objectives	Canonical explanation with "how many nines" framing
SLO implementation details and decision matrix	Google SRE Workbook -- Implementing SLOs	Error-budget policy tiers and stakeholder alignment
Multi-window multi-burn-rate alerting	Google SRE Book -- Practical Alerting	Origin of the burn-rate alert pattern
Golden-signals grounding	Google SRE Book -- Monitoring Distributed Systems	Latency, traffic, errors, saturation

Cluster 2: Observability in Practice

Need	Best source	Why
Signals model (logs / metrics / traces)	OpenTelemetry -- Concepts	Canonical definitions and relationships
Semantic conventions for field names	OpenTelemetry -- Concepts (sem-conv section)	Stable attribute naming across services
Dashboard layout intuition	Google SRE Book -- Monitoring Distributed Systems	Four golden signals structure informs the three-question layout

Cluster 3: Threat Model for the Capstone

Need	Best source	Why
STRIDE + four-question framework	OWASP -- Threat Modeling	Most authoritative short intro
Supply-chain framework (SLSA levels)	SLSA	Level definitions and practitioner guidance
Secret rotation / detection tooling	tool docs: `gitleaks`, `trufflehog`	Official scanner docs -- treat as authoritative for usage
Least privilege patterns (cloud)	provider docs (AWS IAM, GCP IAM, Azure AD)	Always authoritative for exact policy semantics

Cluster 4: Failure Planning

Need	Best source	Why
Retry, circuit breaker, bulkhead patterns	Microsoft Azure Architecture Center -- Circuit Breaker	Canonical state-machine description and considerations
Incident declaration and coordination	Google SRE Workbook -- Incident Response	ICS-derived roles applicable at any scale
Backup and recovery sanity	provider docs (RDS automated backups, PITR)	Ground truth for your specific data store
RPO/RTO framing	Google SRE Workbook -- Implementing SLOs	Connects reliability targets to recovery expectations

Cluster 5: Runbooks and On-Call

Need	Best source	Why
Runbook structure + on-call roles	Google SRE Workbook -- Incident Response	Best short source for incident-command patterns
PRR framing and engagement model	Google SRE Book -- Evolving SRE Engagement Model	Origin of the PRR pattern and the shift to continuous review
Burn-rate + policy at incident time	Google SRE Workbook -- Implementing SLOs	Policy ladder application during incidents

Use Rules

Prefer the SRE book / workbook for SLO, alert, monitoring, and PRR questions.
Prefer OpenTelemetry docs for anything tracing, span, propagation, or sampling related.
Prefer provider docs for IAM, backup, and cloud-specific behavior.
Open one link at a time, with a specific question in mind. Do not read chapters by default.
If a concept page already answered the question, do not read the source -- write the runbook instead.

Canonical Book Backbone​

Source Stack​

Resource Map by Cluster​

Cluster 1: SLOs and Error Budgets​

Cluster 2: Observability in Practice​

Cluster 3: Threat Model for the Capstone​

Cluster 4: Failure Planning​

Cluster 5: Runbooks and On-Call​

Use Rules​