Learning Resources
This module is populated from the local chunked books in library/raw/semester-08-system-design-leadership/books and a short set of validated external resources. Use this page as a source map, not as an instruction to read everything.
Source Stack
| Book | Role | How to use it in this module |
|---|---|---|
| Site Reliability Engineering (Beyer, Jones, Petoff, Murphy - Google) | Primary operating model | Every reliability concept (SLI/SLO, error budget, incident management, postmortem, emergency response). Available free online at sre.google/sre-book |
| The Site Reliability Workbook (Beyer et al., Google) | Practical companion to the SRE Book | SLO design examples, alerting math, implementing error budgets. Available free at sre.google/workbook |
| System Design Primer (donnemartin, local chunks) | Scalability vocabulary and taxonomy | Cache levels and update patterns, availability patterns, latency numbers |
| Fundamentals of Software Architecture (Richards, Ford) | Architecture characteristics framework | Naming and rating reliability, scalability, performance, observability as first-class characteristics; fitness functions as automated checks |
| Building Secure and Reliable Systems | Local support | Use for the overlap between reliability, incident response, security review, and production guardrails |
| Systems Performance (Brendan Gregg) | Selective support for USE | The USE method's original exposition and flame graphs |
| Principles of Chaos Engineering | Selective support for Cluster 3 | The five principles and the hypothesis-driven methodology |
| How Complex Systems Fail (Richard Cook) | Selective support for Cluster 5 | The 18 principles that frame blameless postmortems |
Resource Map by Cluster
Cluster 1: Performance Reasoning
| Need | Best local chunk | Why |
|---|---|---|
| Performance vs scalability distinction | System Design Primer: Performance vs Scalability | The one-page framing of the two quantities |
| Latency vs throughput | System Design Primer: Latency vs Throughput | Compact side-by-side treatment |
| Latency numbers every engineer knows | System Design Primer: Powers of Two and Latency Numbers | Jeff Dean's numbers; essential calibration |
| Scalability characteristic framing | FoSA: Cross-Cutting Architecture Characteristics | Scalability as a first-class trade-off |
| Measurement | FoSA: Measuring Architecture Characteristics | How to turn a characteristic into a dashboard |
Cluster 2: Scaling Strategies
| Need | Best local chunk | Why |
|---|---|---|
| Performance vs scalability | System Design Primer: Performance vs Scalability | Foundation for scale decisions |
| Horizontal scaling and load balancing | System Design Primer: Load Balancer | The practical foundation for scale-out |
| Reverse proxies | System Design Primer: Reverse Proxy | Where sticky vs stateless decisions are enforced |
| Microservices scaling | System Design Primer: Application Layer and Microservices | Why scale-out naturally decomposes tiers |
| Cache levels | System Design Primer: Cache Overview and Levels | Client, CDN, proxy, app, DB layers |
| Cache update patterns | System Design Primer: Cache Update Patterns | Cache-aside, write-through, write-behind, refresh-ahead |
| CDN as the first cache | System Design Primer: Content Delivery Network | Where geographic caching enters the stack |
| Availability patterns | System Design Primer: Availability Patterns | Fail-over and replication as scale+reliability levers |
Cluster 3: Reliability Engineering
| Need | Best local chunk | Why |
|---|---|---|
| Availability patterns catalog | System Design Primer: Availability Patterns | Active-passive, active-active, replication trade-offs |
| CAP framing for reliability | System Design Primer: CAP Theorem | The hard constraint behind availability under partitions |
| Reliability as explicit/implicit characteristic | FoSA: Explicit Characteristics, FoSA: Implicit Characteristics | Why reliability is almost always implicit and therefore under-resourced |
| Fitness functions | FoSA: Fitness Functions | The general frame for automated characteristic checks - error-budget alerts are an instance |
Cluster 4: Capacity Planning and Load
| Need | Best local chunk | Why |
|---|---|---|
| Latency vs throughput | System Design Primer: Latency vs Throughput | Starting point for Little's Law intuition |
| Async/queue patterns | System Design Primer: Asynchronism | Back-pressure, queuing, and decoupling |
| Load balancer | System Design Primer: Load Balancer | Rate limiting and sticky sessions at the LB |
| Availability patterns | System Design Primer: Availability Patterns | Where fail-fast and shedding fit |
| Measurement | FoSA: Measuring Architecture Characteristics | Turning capacity intuition into graphable signals |
Cluster 5: Incident and Observability
| Need | Best local chunk | Why |
|---|---|---|
| Operations and DevOps framing | FoSA: Operations and DevOps | Why reliability is architecture's partner, not operations' alone |
| Cross-cutting characteristics | FoSA: Cross-Cutting Architecture Characteristics | Observability as a first-class characteristic |
| Implicit characteristics (reliability) | FoSA: Implicit Characteristics | Why teams skip reliability planning until too late |
| Measuring characteristics | FoSA: Measuring Architecture Characteristics | From dashboards to automated gates |
| Availability patterns | System Design Primer: Availability Patterns | Failover and recovery mechanics that incident responders use |
Exercise Support Chunks
Use these when concept pages are understood but fluency is weak:
- System Design Primer: Performance vs Scalability
- System Design Primer: Latency vs Throughput
- System Design Primer: Cache Overview and Levels
- System Design Primer: Cache Update Patterns
- System Design Primer: Availability Patterns
External Resources (Validated, Read If Pointed Here)
The module links to specific external posts from concept pages. All validated at the most recent curation pass.
Reliability (Google SRE)
- Google SRE Book (free online) - the canonical operating model. Every reliability concept in this module ties back to a chapter here.
- SRE Book: Service Level Objectives - the chapter on SLIs, SLOs, and error budgets.
- SRE Book: Embracing Risk - where error budgets come from.
- SRE Book: Managing Incidents - incident command protocol.
- SRE Book: Emergency Response - response patterns under pressure.
- SRE Book: Postmortem Culture - the canonical treatment of blameless postmortems.
- The SRE Workbook: Implementing SLOs - the practical guide to SLI/SLO/error-budget rollout.
- The SRE Workbook: Alerting on SLOs - multi-window, multi-burn-rate alert math.
Performance
- Brendan Gregg: The USE Method - the method's original exposition with checklists for every resource.
- Brendan Gregg: Flame Graphs - the visualization that makes CPU profiling tractable.
- Peter Bailis: The Tail at Scale (ACM CACM, 2013) - Dean and Barroso's classic on why fan-out width amplifies latency tails.
- Neil Gunther: Universal Scalability Law - the paper and the software for estimating USL parameters from measurement.
Reliability Engineering (Chaos, Complex Systems)
- Principles of Chaos Engineering - the five principles as the canonical definition.
- Richard Cook: How Complex Systems Fail (18 principles) - the foundational essay for thinking about large-scale failure.
- Netflix: Chaos Monkey and the Simian Army - the practical origin story of chaos engineering in production.
- AWS: Fault Injection Service - managed chaos in cloud environments, for reference.
Observability
- OpenTelemetry - the current open standard for metrics, logs, and traces. Start at the concepts page.
- Google SRE: Monitoring Distributed Systems (Four Golden Signals) - the chapter that defines the four golden signals.
Incident Response
- PagerDuty Incident Response (open-source training) - practical IC, severity, and communication playbooks.
- Google SRE: Incident Response - the SRE Workbook companion to the book's emergency response chapter.
- Etsy: Blameless PostMortems and a Just Culture (John Allspaw) - the canonical industry post on blameless culture.
Use Rules
- For every primary concept, the local book chunk is the first escalation. Reach for SRE Book and SRE Workbook for reliability; System Design Primer for scale and caching; Fundamentals of Software Architecture for the architecture-characteristics view.
- Open one chunk per gap. Do not drift into whole chapters unless you are specifically studying for the capstone.
- External links are targeted: when a concept page says "external," it means the local chunk does not cover that angle.