Reference and Selective Reading

You do not need to read the source books front-to-back for this module. Use the concept pages and practice pages first. Open these local chunks only when you need alternate exposition, more worked examples, or deeper context.

Source Roles

Source	Role	Why it is here
Site Reliability Engineering and The SRE Workbook (Google)	Primary reliability operating model	The most influential treatment of SLIs, SLOs, error budgets, and incident management. Available free at sre.google
System Design Primer (donnemartin, local chunks)	Scalability vocabulary and catalog	Cache levels and update patterns, load balancing, availability patterns, latency numbers
Fundamentals of Software Architecture (Richards, Ford)	Characteristics-and-trade-offs framework	Naming reliability, scalability, performance, observability as first-class characteristics; fitness functions
Systems Performance (Gregg)	Performance methodology	The USE method's original exposition
Principles of Chaos	Chaos engineering canon	The five principles as the working definition
How Complex Systems Fail (Cook)	Cognitive framing for postmortems	The essay that anchors blameless culture

Read Only If Stuck

Cluster 1: Performance Reasoning

External (validated):

Cluster 2: Scaling Strategies

Cluster 3: Reliability Engineering

External (validated):

Cluster 4: Capacity Planning and Load

External (validated):

Cluster 5: Incident and Observability

External (validated):

Optional Deep Dive

FoSA: Fitness Functions - error-budget alerts are an instance of automated fitness functions over a reliability characteristic.
System Design Primer: Consistency Patterns - where availability and consistency trade off at scale.
System Design Primer: CAP Theorem - the hard constraint behind multi-region reliability.
Google SRE Book: The Production Environment at Google - situating SRE concepts in real production infrastructure.

Concept-to-Source Map

Primary concept	Best source if stuck	Why this source
Latency, throughput, utilization	System Design Primer: Latency vs Throughput	The cleanest compact side-by-side
USE / RED / Golden Signals	Brendan Gregg: The USE Method, SRE: Monitoring Distributed Systems	The three frameworks side-by-side
Percentile latency	Dean & Barroso: The Tail at Scale	The canonical paper on why tails dominate fan-out
Amdahl / USL	Neil Gunther: USL	The authoritative modern USL source
Horizontal vs vertical scaling	System Design Primer: Performance vs Scalability	The direct framing of the two axes
Statelessness and sessions	System Design Primer: Load Balancer	Where sticky vs stateless is enforced
Caching layers	System Design Primer: Cache Overview and Levels	Every layer enumerated
Cache update patterns	System Design Primer: Cache Update Patterns	Cache-aside, write-through, write-behind, refresh-ahead
SLI / SLO / error budget	SRE: Service Level Objectives, SRE Workbook: Implementing SLOs	Canonical treatment with worked examples
Failure modes	SRE: Addressing Cascading Failures	The chapter on cascading failure and its mitigations
Chaos engineering	Principles of Chaos	The five principles, short and authoritative
Back-pressure / Little's Law	SRE: Handling Overload	The operational view
Rate limiting / load shedding	SRE: Handling Overload	Admission control and shedding in practice
Capacity planning	SRE: Addressing Cascading Failures, System Design Primer: Powers of Two and Latency Numbers	Numbers and mental models
Observability pillars	OpenTelemetry concepts, FoSA: Measuring Architecture Characteristics	Modern framing plus characteristics view
Incident lifecycle	SRE: Managing Incidents, PagerDuty Incident Response	The canonical and the practitioner's playbook
Blameless postmortems	SRE: Postmortem Culture, Cook: How Complex Systems Fail	Practice and theory of blamelessness

Source Roles​

Read Only If Stuck​

Cluster 1: Performance Reasoning​

Cluster 2: Scaling Strategies​

Cluster 3: Reliability Engineering​

Cluster 4: Capacity Planning and Load​

Cluster 5: Incident and Observability​

Optional Deep Dive​

Concept-to-Source Map​

Source Roles

Read Only If Stuck

Cluster 1: Performance Reasoning

Cluster 2: Scaling Strategies

Cluster 3: Reliability Engineering

Cluster 4: Capacity Planning and Load

Cluster 5: Incident and Observability

Optional Deep Dive

Concept-to-Source Map