Skip to main content

Reference and Selective Reading

You do not need to read the source books front-to-back for this module. Use the concept pages and practice pages first. Open these local chunks only when you need alternate exposition, more worked examples, or deeper context.

Source Roles

SourceRoleWhy it is here
Site Reliability Engineering and The SRE Workbook (Google)Primary reliability operating modelThe most influential treatment of SLIs, SLOs, error budgets, and incident management. Available free at sre.google
System Design Primer (donnemartin, local chunks)Scalability vocabulary and catalogCache levels and update patterns, load balancing, availability patterns, latency numbers
Fundamentals of Software Architecture (Richards, Ford)Characteristics-and-trade-offs frameworkNaming reliability, scalability, performance, observability as first-class characteristics; fitness functions
Systems Performance (Gregg)Performance methodologyThe USE method's original exposition
Principles of ChaosChaos engineering canonThe five principles as the working definition
How Complex Systems Fail (Cook)Cognitive framing for postmortemsThe essay that anchors blameless culture

Read Only If Stuck

Cluster 1: Performance Reasoning

External (validated):

Cluster 2: Scaling Strategies

Cluster 3: Reliability Engineering

External (validated):

Cluster 4: Capacity Planning and Load

External (validated):

Cluster 5: Incident and Observability

External (validated):

Optional Deep Dive

Concept-to-Source Map

Primary conceptBest source if stuckWhy this source
Latency, throughput, utilizationSystem Design Primer: Latency vs ThroughputThe cleanest compact side-by-side
USE / RED / Golden SignalsBrendan Gregg: The USE Method, SRE: Monitoring Distributed SystemsThe three frameworks side-by-side
Percentile latencyDean & Barroso: The Tail at ScaleThe canonical paper on why tails dominate fan-out
Amdahl / USLNeil Gunther: USLThe authoritative modern USL source
Horizontal vs vertical scalingSystem Design Primer: Performance vs ScalabilityThe direct framing of the two axes
Statelessness and sessionsSystem Design Primer: Load BalancerWhere sticky vs stateless is enforced
Caching layersSystem Design Primer: Cache Overview and LevelsEvery layer enumerated
Cache update patternsSystem Design Primer: Cache Update PatternsCache-aside, write-through, write-behind, refresh-ahead
SLI / SLO / error budgetSRE: Service Level Objectives, SRE Workbook: Implementing SLOsCanonical treatment with worked examples
Failure modesSRE: Addressing Cascading FailuresThe chapter on cascading failure and its mitigations
Chaos engineeringPrinciples of ChaosThe five principles, short and authoritative
Back-pressure / Little's LawSRE: Handling OverloadThe operational view
Rate limiting / load sheddingSRE: Handling OverloadAdmission control and shedding in practice
Capacity planningSRE: Addressing Cascading Failures, System Design Primer: Powers of Two and Latency NumbersNumbers and mental models
Observability pillarsOpenTelemetry concepts, FoSA: Measuring Architecture CharacteristicsModern framing plus characteristics view
Incident lifecycleSRE: Managing Incidents, PagerDuty Incident ResponseThe canonical and the practitioner's playbook
Blameless postmortemsSRE: Postmortem Culture, Cook: How Complex Systems FailPractice and theory of blamelessness