Learning Resources

This module is populated from the local chunked books in library/raw/semester-08-system-design-leadership/books and a short set of validated external resources. Use this page as a source map, not as an instruction to read everything.

Source Stack

Book	Role	How to use it in this module
Site Reliability Engineering (Beyer, Jones, Petoff, Murphy - Google)	Primary operating model	Every reliability concept (SLI/SLO, error budget, incident management, postmortem, emergency response). Available free online at sre.google/sre-book
The Site Reliability Workbook (Beyer et al., Google)	Practical companion to the SRE Book	SLO design examples, alerting math, implementing error budgets. Available free at sre.google/workbook
System Design Primer (donnemartin, local chunks)	Scalability vocabulary and taxonomy	Cache levels and update patterns, availability patterns, latency numbers
Fundamentals of Software Architecture (Richards, Ford)	Architecture characteristics framework	Naming and rating reliability, scalability, performance, observability as first-class characteristics; fitness functions as automated checks
Building Secure and Reliable Systems	Local support	Use for the overlap between reliability, incident response, security review, and production guardrails
Systems Performance (Brendan Gregg)	Selective support for USE	The USE method's original exposition and flame graphs
Principles of Chaos Engineering	Selective support for Cluster 3	The five principles and the hypothesis-driven methodology
How Complex Systems Fail (Richard Cook)	Selective support for Cluster 5	The 18 principles that frame blameless postmortems

Resource Map by Cluster

Cluster 1: Performance Reasoning

Need	Best local chunk	Why
Performance vs scalability distinction	System Design Primer: Performance vs Scalability	The one-page framing of the two quantities
Latency vs throughput	System Design Primer: Latency vs Throughput	Compact side-by-side treatment
Latency numbers every engineer knows	System Design Primer: Powers of Two and Latency Numbers	Jeff Dean's numbers; essential calibration
Scalability characteristic framing	FoSA: Cross-Cutting Architecture Characteristics	Scalability as a first-class trade-off
Measurement	FoSA: Measuring Architecture Characteristics	How to turn a characteristic into a dashboard

Cluster 2: Scaling Strategies

Need	Best local chunk	Why
Performance vs scalability	System Design Primer: Performance vs Scalability	Foundation for scale decisions
Horizontal scaling and load balancing	System Design Primer: Load Balancer	The practical foundation for scale-out
Reverse proxies	System Design Primer: Reverse Proxy	Where sticky vs stateless decisions are enforced
Microservices scaling	System Design Primer: Application Layer and Microservices	Why scale-out naturally decomposes tiers
Cache levels	System Design Primer: Cache Overview and Levels	Client, CDN, proxy, app, DB layers
Cache update patterns	System Design Primer: Cache Update Patterns	Cache-aside, write-through, write-behind, refresh-ahead
CDN as the first cache	System Design Primer: Content Delivery Network	Where geographic caching enters the stack
Availability patterns	System Design Primer: Availability Patterns	Fail-over and replication as scale+reliability levers

Cluster 3: Reliability Engineering

Need	Best local chunk	Why
Availability patterns catalog	System Design Primer: Availability Patterns	Active-passive, active-active, replication trade-offs
CAP framing for reliability	System Design Primer: CAP Theorem	The hard constraint behind availability under partitions
Reliability as explicit/implicit characteristic	FoSA: Explicit Characteristics, FoSA: Implicit Characteristics	Why reliability is almost always implicit and therefore under-resourced
Fitness functions	FoSA: Fitness Functions	The general frame for automated characteristic checks - error-budget alerts are an instance

Cluster 4: Capacity Planning and Load

Need	Best local chunk	Why
Latency vs throughput	System Design Primer: Latency vs Throughput	Starting point for Little's Law intuition
Async/queue patterns	System Design Primer: Asynchronism	Back-pressure, queuing, and decoupling
Load balancer	System Design Primer: Load Balancer	Rate limiting and sticky sessions at the LB
Availability patterns	System Design Primer: Availability Patterns	Where fail-fast and shedding fit
Measurement	FoSA: Measuring Architecture Characteristics	Turning capacity intuition into graphable signals

Cluster 5: Incident and Observability

Need	Best local chunk	Why
Operations and DevOps framing	FoSA: Operations and DevOps	Why reliability is architecture's partner, not operations' alone
Cross-cutting characteristics	FoSA: Cross-Cutting Architecture Characteristics	Observability as a first-class characteristic
Implicit characteristics (reliability)	FoSA: Implicit Characteristics	Why teams skip reliability planning until too late
Measuring characteristics	FoSA: Measuring Architecture Characteristics	From dashboards to automated gates
Availability patterns	System Design Primer: Availability Patterns	Failover and recovery mechanics that incident responders use

Exercise Support Chunks

Use these when concept pages are understood but fluency is weak:

External Resources (Validated, Read If Pointed Here)

The module links to specific external posts from concept pages. All validated at the most recent curation pass.

Reliability (Google SRE)

Google SRE Book (free online) - the canonical operating model. Every reliability concept in this module ties back to a chapter here.
SRE Book: Service Level Objectives - the chapter on SLIs, SLOs, and error budgets.
SRE Book: Embracing Risk - where error budgets come from.
SRE Book: Managing Incidents - incident command protocol.
SRE Book: Emergency Response - response patterns under pressure.
SRE Book: Postmortem Culture - the canonical treatment of blameless postmortems.
The SRE Workbook: Implementing SLOs - the practical guide to SLI/SLO/error-budget rollout.
The SRE Workbook: Alerting on SLOs - multi-window, multi-burn-rate alert math.

Performance

Brendan Gregg: The USE Method - the method's original exposition with checklists for every resource.
Brendan Gregg: Flame Graphs - the visualization that makes CPU profiling tractable.
Peter Bailis: The Tail at Scale (ACM CACM, 2013) - Dean and Barroso's classic on why fan-out width amplifies latency tails.
Neil Gunther: Universal Scalability Law - the paper and the software for estimating USL parameters from measurement.

Reliability Engineering (Chaos, Complex Systems)

Principles of Chaos Engineering - the five principles as the canonical definition.
Richard Cook: How Complex Systems Fail (18 principles) - the foundational essay for thinking about large-scale failure.
Netflix: Chaos Monkey and the Simian Army - the practical origin story of chaos engineering in production.
AWS: Fault Injection Service - managed chaos in cloud environments, for reference.

Observability

OpenTelemetry - the current open standard for metrics, logs, and traces. Start at the concepts page.
Google SRE: Monitoring Distributed Systems (Four Golden Signals) - the chapter that defines the four golden signals.

Incident Response

PagerDuty Incident Response (open-source training) - practical IC, severity, and communication playbooks.
Google SRE: Incident Response - the SRE Workbook companion to the book's emergency response chapter.
Etsy: Blameless PostMortems and a Just Culture (John Allspaw) - the canonical industry post on blameless culture.

Use Rules

For every primary concept, the local book chunk is the first escalation. Reach for SRE Book and SRE Workbook for reliability; System Design Primer for scale and caching; Fundamentals of Software Architecture for the architecture-characteristics view.
Open one chunk per gap. Do not drift into whole chapters unless you are specifically studying for the capstone.
External links are targeted: when a concept page says "external," it means the local chunk does not cover that angle.

Source Stack​

Resource Map by Cluster​

Cluster 1: Performance Reasoning​

Cluster 2: Scaling Strategies​

Cluster 3: Reliability Engineering​

Cluster 4: Capacity Planning and Load​

Cluster 5: Incident and Observability​

Exercise Support Chunks​

External Resources (Validated, Read If Pointed Here)​

Reliability (Google SRE)​

Performance​

Reliability Engineering (Chaos, Complex Systems)​

Observability​

Incident Response​

Use Rules​