Reference and Selective Reading
You do not need to read the source books front-to-back for this module. Use the concept pages and practice pages first. Open these local chunks only when you need alternate exposition, more worked examples, or deeper context.
Source Roles
| Source | Role | Why it is here |
|---|---|---|
| Site Reliability Engineering and The SRE Workbook (Google) | Primary reliability operating model | The most influential treatment of SLIs, SLOs, error budgets, and incident management. Available free at sre.google |
| System Design Primer (donnemartin, local chunks) | Scalability vocabulary and catalog | Cache levels and update patterns, load balancing, availability patterns, latency numbers |
| Fundamentals of Software Architecture (Richards, Ford) | Characteristics-and-trade-offs framework | Naming reliability, scalability, performance, observability as first-class characteristics; fitness functions |
| Systems Performance (Gregg) | Performance methodology | The USE method's original exposition |
| Principles of Chaos | Chaos engineering canon | The five principles as the working definition |
| How Complex Systems Fail (Cook) | Cognitive framing for postmortems | The essay that anchors blameless culture |
Read Only If Stuck
Cluster 1: Performance Reasoning
- System Design Primer: Performance vs Scalability
- System Design Primer: Latency vs Throughput
- System Design Primer: Powers of Two and Latency Numbers
- FoSA: Architecture Characteristics Defined
- FoSA: Cross-Cutting Architecture Characteristics
- FoSA: Measuring Architecture Characteristics
External (validated):
- Brendan Gregg: The USE Method
- Brendan Gregg: Flame Graphs
- Dean & Barroso: The Tail at Scale
- Neil Gunther: Universal Scalability Law
- Google SRE Book: Monitoring Distributed Systems (Four Golden Signals)
Cluster 2: Scaling Strategies
- System Design Primer: Performance vs Scalability
- System Design Primer: Load Balancer
- System Design Primer: Reverse Proxy
- System Design Primer: Application Layer and Microservices
- System Design Primer: Content Delivery Network
- System Design Primer: Cache Overview and Levels
- System Design Primer: Cache Update Patterns
- System Design Primer: Availability Patterns
Cluster 3: Reliability Engineering
- System Design Primer: Availability Patterns
- System Design Primer: CAP Theorem
- System Design Primer: Consistency Patterns
- FoSA: Explicit Characteristics
- FoSA: Implicit Characteristics
- FoSA: Fitness Functions
External (validated):
- Google SRE Book: Embracing Risk
- Google SRE Book: Service Level Objectives
- SRE Workbook: Implementing SLOs
- SRE Workbook: Alerting on SLOs
- Principles of Chaos Engineering
- Richard Cook: How Complex Systems Fail
- Google SRE Book: Addressing Cascading Failures
Cluster 4: Capacity Planning and Load
- System Design Primer: Latency vs Throughput
- System Design Primer: Asynchronism
- System Design Primer: Load Balancer
- System Design Primer: Availability Patterns
- System Design Primer: Powers of Two and Latency Numbers
- FoSA: Measuring Architecture Characteristics
External (validated):
- Google SRE Book: Handling Overload
- Google SRE Book: Addressing Cascading Failures
- Neil Gunther: Universal Scalability Law
Cluster 5: Incident and Observability
- FoSA: Operations and DevOps
- FoSA: Cross-Cutting Architecture Characteristics
- FoSA: Implicit Characteristics
- FoSA: Measuring Architecture Characteristics
- System Design Primer: Availability Patterns
External (validated):
- Google SRE Book: Monitoring Distributed Systems
- Google SRE Book: Managing Incidents
- Google SRE Book: Emergency Response
- Google SRE Book: Postmortem Culture
- Etsy: Blameless PostMortems and a Just Culture
- PagerDuty: Incident Response
- OpenTelemetry
Optional Deep Dive
- FoSA: Fitness Functions - error-budget alerts are an instance of automated fitness functions over a reliability characteristic.
- System Design Primer: Consistency Patterns - where availability and consistency trade off at scale.
- System Design Primer: CAP Theorem - the hard constraint behind multi-region reliability.
- Google SRE Book: The Production Environment at Google - situating SRE concepts in real production infrastructure.
Concept-to-Source Map
| Primary concept | Best source if stuck | Why this source |
|---|---|---|
| Latency, throughput, utilization | System Design Primer: Latency vs Throughput | The cleanest compact side-by-side |
| USE / RED / Golden Signals | Brendan Gregg: The USE Method, SRE: Monitoring Distributed Systems | The three frameworks side-by-side |
| Percentile latency | Dean & Barroso: The Tail at Scale | The canonical paper on why tails dominate fan-out |
| Amdahl / USL | Neil Gunther: USL | The authoritative modern USL source |
| Horizontal vs vertical scaling | System Design Primer: Performance vs Scalability | The direct framing of the two axes |
| Statelessness and sessions | System Design Primer: Load Balancer | Where sticky vs stateless is enforced |
| Caching layers | System Design Primer: Cache Overview and Levels | Every layer enumerated |
| Cache update patterns | System Design Primer: Cache Update Patterns | Cache-aside, write-through, write-behind, refresh-ahead |
| SLI / SLO / error budget | SRE: Service Level Objectives, SRE Workbook: Implementing SLOs | Canonical treatment with worked examples |
| Failure modes | SRE: Addressing Cascading Failures | The chapter on cascading failure and its mitigations |
| Chaos engineering | Principles of Chaos | The five principles, short and authoritative |
| Back-pressure / Little's Law | SRE: Handling Overload | The operational view |
| Rate limiting / load shedding | SRE: Handling Overload | Admission control and shedding in practice |
| Capacity planning | SRE: Addressing Cascading Failures, System Design Primer: Powers of Two and Latency Numbers | Numbers and mental models |
| Observability pillars | OpenTelemetry concepts, FoSA: Measuring Architecture Characteristics | Modern framing plus characteristics view |
| Incident lifecycle | SRE: Managing Incidents, PagerDuty Incident Response | The canonical and the practitioner's playbook |
| Blameless postmortems | SRE: Postmortem Culture, Cook: How Complex Systems Fail | Practice and theory of blamelessness |