Skip to main content

Learning Resources

This module is populated from the local chunked books in library/raw/semester-08-system-design-leadership/books and a short set of validated external resources. Use this page as a source map, not as an instruction to read everything.

Source Stack

BookRoleHow to use it in this module
Site Reliability Engineering (Beyer, Jones, Petoff, Murphy - Google)Primary operating modelEvery reliability concept (SLI/SLO, error budget, incident management, postmortem, emergency response). Available free online at sre.google/sre-book
The Site Reliability Workbook (Beyer et al., Google)Practical companion to the SRE BookSLO design examples, alerting math, implementing error budgets. Available free at sre.google/workbook
System Design Primer (donnemartin, local chunks)Scalability vocabulary and taxonomyCache levels and update patterns, availability patterns, latency numbers
Fundamentals of Software Architecture (Richards, Ford)Architecture characteristics frameworkNaming and rating reliability, scalability, performance, observability as first-class characteristics; fitness functions as automated checks
Building Secure and Reliable SystemsLocal supportUse for the overlap between reliability, incident response, security review, and production guardrails
Systems Performance (Brendan Gregg)Selective support for USEThe USE method's original exposition and flame graphs
Principles of Chaos EngineeringSelective support for Cluster 3The five principles and the hypothesis-driven methodology
How Complex Systems Fail (Richard Cook)Selective support for Cluster 5The 18 principles that frame blameless postmortems

Resource Map by Cluster

Cluster 1: Performance Reasoning

NeedBest local chunkWhy
Performance vs scalability distinctionSystem Design Primer: Performance vs ScalabilityThe one-page framing of the two quantities
Latency vs throughputSystem Design Primer: Latency vs ThroughputCompact side-by-side treatment
Latency numbers every engineer knowsSystem Design Primer: Powers of Two and Latency NumbersJeff Dean's numbers; essential calibration
Scalability characteristic framingFoSA: Cross-Cutting Architecture CharacteristicsScalability as a first-class trade-off
MeasurementFoSA: Measuring Architecture CharacteristicsHow to turn a characteristic into a dashboard

Cluster 2: Scaling Strategies

NeedBest local chunkWhy
Performance vs scalabilitySystem Design Primer: Performance vs ScalabilityFoundation for scale decisions
Horizontal scaling and load balancingSystem Design Primer: Load BalancerThe practical foundation for scale-out
Reverse proxiesSystem Design Primer: Reverse ProxyWhere sticky vs stateless decisions are enforced
Microservices scalingSystem Design Primer: Application Layer and MicroservicesWhy scale-out naturally decomposes tiers
Cache levelsSystem Design Primer: Cache Overview and LevelsClient, CDN, proxy, app, DB layers
Cache update patternsSystem Design Primer: Cache Update PatternsCache-aside, write-through, write-behind, refresh-ahead
CDN as the first cacheSystem Design Primer: Content Delivery NetworkWhere geographic caching enters the stack
Availability patternsSystem Design Primer: Availability PatternsFail-over and replication as scale+reliability levers

Cluster 3: Reliability Engineering

NeedBest local chunkWhy
Availability patterns catalogSystem Design Primer: Availability PatternsActive-passive, active-active, replication trade-offs
CAP framing for reliabilitySystem Design Primer: CAP TheoremThe hard constraint behind availability under partitions
Reliability as explicit/implicit characteristicFoSA: Explicit Characteristics, FoSA: Implicit CharacteristicsWhy reliability is almost always implicit and therefore under-resourced
Fitness functionsFoSA: Fitness FunctionsThe general frame for automated characteristic checks - error-budget alerts are an instance

Cluster 4: Capacity Planning and Load

NeedBest local chunkWhy
Latency vs throughputSystem Design Primer: Latency vs ThroughputStarting point for Little's Law intuition
Async/queue patternsSystem Design Primer: AsynchronismBack-pressure, queuing, and decoupling
Load balancerSystem Design Primer: Load BalancerRate limiting and sticky sessions at the LB
Availability patternsSystem Design Primer: Availability PatternsWhere fail-fast and shedding fit
MeasurementFoSA: Measuring Architecture CharacteristicsTurning capacity intuition into graphable signals

Cluster 5: Incident and Observability

NeedBest local chunkWhy
Operations and DevOps framingFoSA: Operations and DevOpsWhy reliability is architecture's partner, not operations' alone
Cross-cutting characteristicsFoSA: Cross-Cutting Architecture CharacteristicsObservability as a first-class characteristic
Implicit characteristics (reliability)FoSA: Implicit CharacteristicsWhy teams skip reliability planning until too late
Measuring characteristicsFoSA: Measuring Architecture CharacteristicsFrom dashboards to automated gates
Availability patternsSystem Design Primer: Availability PatternsFailover and recovery mechanics that incident responders use

Exercise Support Chunks

Use these when concept pages are understood but fluency is weak:

External Resources (Validated, Read If Pointed Here)

The module links to specific external posts from concept pages. All validated at the most recent curation pass.

Reliability (Google SRE)

Performance

Reliability Engineering (Chaos, Complex Systems)

Observability

Incident Response

Use Rules

  • For every primary concept, the local book chunk is the first escalation. Reach for SRE Book and SRE Workbook for reliability; System Design Primer for scale and caching; Fundamentals of Software Architecture for the architecture-characteristics view.
  • Open one chunk per gap. Do not drift into whole chapters unless you are specifically studying for the capstone.
  • External links are targeted: when a concept page says "external," it means the local chunk does not cover that angle.