Skip to main content

Learning Resources

This module has no new required books. It integrates material from S6, S8, and S9 and directs the learner to a small set of authoritative external sources for each cluster. Use this page as a source map, not as an instruction to read all of it.

Canonical Book Backbone

Use these canonical book routes when you need a book-backed source of truth for reliability, distributed behavior, and secure operations:

Source Stack

SourceRoleHow to use it in this module
Google SRE Book (sre.google/sre-book)Primary teaching source for SLOs, alerting, monitoring, and PRRDefault escalation for any SLO/alert/monitoring question
Google SRE Workbook (sre.google/workbook)Practical extension of the SRE bookUse for error-budget policy, SLO implementation, and incident-response roles
OpenTelemetry docs (opentelemetry.io/docs)Official standard for tracing, metrics, logsThe source of truth for spans, context propagation, sampling, semantic conventions
OWASP -- Threat Modeling (owasp.org)Primary teaching source for STRIDE and threat-model framingUse for the four-question framework and category definitions
SLSA (slsa.dev)Framework for supply-chain integrityUse for SLSA Build L2 target and provenance patterns
Microsoft Azure Architecture Center (learn.microsoft.com/azure/architecture)Official cloud-pattern catalogUse for circuit breaker, retry, bulkhead, and related resilience patterns
Building Secure and Reliable Systems (Google, free online)Selective reinforcementOpen only for cross-topic framing (security + reliability) when the Cluster 3 concept pages are insufficient

Resource Map by Cluster

Cluster 1: SLOs and Error Budgets

NeedBest sourceWhy
SLI / SLO / SLA definitions and framingGoogle SRE Book -- Service Level ObjectivesCanonical explanation with "how many nines" framing
SLO implementation details and decision matrixGoogle SRE Workbook -- Implementing SLOsError-budget policy tiers and stakeholder alignment
Multi-window multi-burn-rate alertingGoogle SRE Book -- Practical AlertingOrigin of the burn-rate alert pattern
Golden-signals groundingGoogle SRE Book -- Monitoring Distributed SystemsLatency, traffic, errors, saturation

Cluster 2: Observability in Practice

NeedBest sourceWhy
Signals model (logs / metrics / traces)OpenTelemetry -- ConceptsCanonical definitions and relationships
Semantic conventions for field namesOpenTelemetry -- Concepts (sem-conv section)Stable attribute naming across services
Dashboard layout intuitionGoogle SRE Book -- Monitoring Distributed SystemsFour golden signals structure informs the three-question layout

Cluster 3: Threat Model for the Capstone

NeedBest sourceWhy
STRIDE + four-question frameworkOWASP -- Threat ModelingMost authoritative short intro
Supply-chain framework (SLSA levels)SLSALevel definitions and practitioner guidance
Secret rotation / detection toolingtool docs: gitleaks, trufflehogOfficial scanner docs -- treat as authoritative for usage
Least privilege patterns (cloud)provider docs (AWS IAM, GCP IAM, Azure AD)Always authoritative for exact policy semantics

Cluster 4: Failure Planning

NeedBest sourceWhy
Retry, circuit breaker, bulkhead patternsMicrosoft Azure Architecture Center -- Circuit BreakerCanonical state-machine description and considerations
Incident declaration and coordinationGoogle SRE Workbook -- Incident ResponseICS-derived roles applicable at any scale
Backup and recovery sanityprovider docs (RDS automated backups, PITR)Ground truth for your specific data store
RPO/RTO framingGoogle SRE Workbook -- Implementing SLOsConnects reliability targets to recovery expectations

Cluster 5: Runbooks and On-Call

NeedBest sourceWhy
Runbook structure + on-call rolesGoogle SRE Workbook -- Incident ResponseBest short source for incident-command patterns
PRR framing and engagement modelGoogle SRE Book -- Evolving SRE Engagement ModelOrigin of the PRR pattern and the shift to continuous review
Burn-rate + policy at incident timeGoogle SRE Workbook -- Implementing SLOsPolicy ladder application during incidents

Use Rules

  • Prefer the SRE book / workbook for SLO, alert, monitoring, and PRR questions.
  • Prefer OpenTelemetry docs for anything tracing, span, propagation, or sampling related.
  • Prefer provider docs for IAM, backup, and cloud-specific behavior.
  • Open one link at a time, with a specific question in mind. Do not read chapters by default.
  • If a concept page already answered the question, do not read the source -- write the runbook instead.