Module 4: Scale, Reliability & Performance
Primary texts: The Site Reliability Engineering Book (Beyer et al., Google) and The Site Reliability Workbook as the operating model; System Design Primer for the scalability vocabulary; Fundamentals of Software Architecture (Richards, Ford) for architecture characteristics and fitness functions
Selective support: Brendan Gregg's Systems Performance for the USE method; Richard Cook's How Complex Systems Fail; Principles of Chaos (principlesofchaos.org); Neil Gunther's Universal Scalability Law papers; the local system-design-primer chunks for latency numbers, caching levels, and availability patterns
This module is where "my service works on one box" becomes "my service survives a Black Friday spike, one bad deploy, and a regional outage without waking anyone up at 3am who shouldn't be awake." You already know distributed-systems fundamentals (Semester 6, Module 5) and replication and partitioning (Semester 6, Module 3). Here you learn the operating discipline that makes those designs stay up under real load and real failure.
Scope of This Module
This module is not "a catalog of scalability tricks." It is where reliability stops being a vibe and becomes a number, where performance stops being a benchmark and becomes a distribution, and where incident response stops being heroics and becomes a lifecycle.
What it covers in depth:
- latency, throughput, and utilization as three different quantities, and the USE, RED, and Four Golden Signals families that monitor them
- why averages hide outages and why the entire production dialect of "performance" is written in percentiles
- Amdahl's Law and the Universal Scalability Law, and why throwing machines at a problem has a ceiling
- vertical versus horizontal scaling and why "stateless" is the word that separates the two worlds
- statelessness, sticky sessions, and external session stores
- caching at every layer of the stack, cache update patterns, and why "there are only two hard problems in CS" is about this
- Service Level Indicators, Service Level Objectives, and error budgets as the contract between reliability and velocity
- cascading, correlated, and gray failures - the three failure patterns that defeat naive redundancy
- chaos engineering and game days as the empirical discipline that replaces "we tested it in staging"
- Little's Law, queuing intuition, and back-pressure as the math behind overload
- load shedding, rate limiting, and admission control as the levers you actually pull under overload
- capacity planning with headroom, peaks, and growth curves
- the three pillars of observability: metrics, logs, traces
- the incident lifecycle: detect, triage, mitigate, resolve, review
- blameless postmortems and the learning systems that turn one outage into structural improvements
What it deliberately does not try to finish here:
- a full performance-tuning curriculum (flame graphs, kernel tracing,
perf, eBPF) - database-specific tuning beyond what Semester 6 covered
- a complete incident-command playbook (that belongs in Module 5 leadership context)
- SLO math in its deepest form (multi-window, multi-burn-rate alerts get referenced but not exhaustively modeled)
If you leave this module still saying "we'll scale it when we need to" or "the average response time looks fine," the module is not complete.
Before You Start
Answer these closed-book before starting the main path:
- Your service's average latency is 120ms. Is that a good number? Why is the answer almost always "I don't have enough information"?
- Your single-instance service handles 1,000 req/s. You add nine more instances behind a load balancer. What is the upper bound on new throughput, and why is it almost never 10,000?
- A bug causes one microservice to time out. The other six services that depend on it start returning 500s. What is this pattern called and what protects against it?
- Your team's SLO is "99.9% availability per quarter." Over 90 days, how much downtime does the error budget permit?
- It's 3am and the pager just went off. What is the very first thing you do - and why is it not "open the code"?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path.
2-3 solid answers
- Continue, but expect extra time in Cluster 3 (Reliability Engineering) and Cluster 4 (Capacity Planning and Load).
0-1 solid answers
- Revisit Semester 6 Module 5 (Distributed Systems Fundamentals) briefly. Reliability work here only makes sense on top of a concrete mental model of partial failure and retries.
What This Module Is For
Scale and reliability are the two places single-machine engineering intuitions quietly go to die. Later work in your career will repeatedly ask:
- my dashboards are green but users are complaining - what is wrong with my dashboards?
- my service is at 40% CPU but request latency just doubled - what else is saturated?
- my backend started timing out and now my frontend is also down - why did the blast radius grow?
- my retry storm is making the outage worse - how do I shed load instead of amplifying it?
- my SLO is 99.9% and we burned half the error budget in three days - do we ship the feature or freeze?
This module builds the reasoning needed for:
- production on-call work where you must reason about tail latency, not averages
- architecture reviews where "what happens at 10x load" is an actual answer, not a shrug
- SLO-driven development, error budgets, and the engineering-product negotiation they enable
- postmortem authoring and the cultural discipline that makes learning systematic
- the Semester 8 capstone and Module 5 leadership work
You are learning to reason about load and failure without handwaving.
Concept Map
How To Use This Module
Work in order. Later clusters only make sense if the earlier performance vocabulary is stable.
Cluster 1: Performance Reasoning
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals | PRIMARY | The three distinct quantities every engineer must separate, and the three framings that organize every dashboard you will ever build |
| 2 | Percentile Latency and Why Averages Lie | PRIMARY | p50, p95, p99, p999, tail amplification, and the worked example where two systems with identical averages have very different user experience |
| 3 | Amdahl's Law and the Universal Scalability Law: The Limits of Parallelism | PRIMARY | Why linear scale-out is impossible, where the ceiling is, and why more machines can sometimes make things slower |
Cluster mastery check: Given a latency histogram, can you distinguish p50, p95, p99 in words; estimate which golden signal would alert first under overload; and name the USL coefficient that makes scaling go backwards?
Cluster 2: Scaling Strategies
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | Vertical vs Horizontal Scaling: The When and Why | PRIMARY | Scale up vs scale out, cost curves, operational consequences, and the failure modes that determine the choice |
| 5 | Statelessness, Sticky Sessions, and Session Stores | PRIMARY | Why stateless is the precondition for cheap horizontal scaling, and the three places session state can live |
| 6 | Caching Strategies at Every Layer | PRIMARY | Client, CDN, reverse proxy, application, and database caches; cache-aside, write-through, write-behind, refresh-ahead; invalidation as the hard part |
Cluster mastery check: Given a synchronous request-response service at 1k req/s needing to grow to 50k, can you name at least five caching-or-stateless changes you would make, in order of impact, and the new failure modes each introduces?
Cluster 3: Reliability Engineering
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | SLIs, SLOs, and Error Budgets | PRIMARY | Indicators, objectives, the error-budget math, and the negotiation the budget enables between reliability and velocity |
| 8 | Failure Modes: Cascading, Correlated, and Gray | PRIMARY | The three patterns that defeat naive redundancy, plus the concrete mitigations: timeouts, circuit breakers, bulkheads, and hedged requests |
| 9 | Chaos Engineering and Game Days | PRIMARY | Experimentation as reliability evidence, the four principles of chaos, the minimum-viable game day, and why staging is not an answer |
Cluster mastery check: Given a 99.9% SLO that is being burned in 30 days, can you write the error-budget policy, identify one cascading and one gray failure your service is exposed to, and design a one-hour game day that would test one of them?
Cluster 4: Capacity Planning and Load
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | Back-Pressure, Queuing Theory Basics, and Little's Law | PRIMARY | L = lambda * W, why utilization past 80% is a latency cliff, and what back-pressure actually sends upstream |
| 11 | Load Shedding, Rate Limiting, and Admission Control | PRIMARY | Token bucket, leaky bucket, concurrency limits, adaptive shedding, and the cost of work accepted but not served |
| 12 | Capacity Planning: Headroom, Peaks, and Growth Modeling | PRIMARY | Peak-to-mean ratios, headroom targets, growth extrapolation, and how to produce a defensible capacity number you do not need to apologize for |
Cluster mastery check: For a service with mean 200 req/s and peak 1,500 req/s, can you estimate Little's Law quantities, pick a rate-limit, set a headroom target, and produce a six-month capacity forecast from first principles?
Cluster 5: Incident and Observability
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Observability Three Pillars: Metrics, Logs, Traces | PRIMARY | What each pillar is cheap at, what each is expensive at, the known-vs-unknown-unknowns framing, and why one pillar alone is not observability |
| 14 | Incident Lifecycle: Detect, Triage, Mitigate, Resolve, Review | PRIMARY | The five phases, the two clocks that matter (MTTD and MTTR), the Incident Commander role, and why "fix it before you understand it" is the right answer |
| 15 | Blameless Postmortems and Learning from Incidents | SUPPORTING | The postmortem template, the blameless contract, the difference between cause and contributing factor, and why the action-item list is where most postmortems fail |
Cluster mastery check: Given a one-hour partial outage traced to a slow dependency and a misconfigured timeout, can you walk through the full lifecycle from page to postmortem, and distinguish the actions that should block the review from those that should not?
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | Performance Profiling Lab | USE method walkthrough, percentile histograms by hand, Amdahl speedup calculation |
| 2 | Scaling Design Workshop | Take a monolithic service from 1k to 50k req/s by design; defend each choice |
| 3 | Reliability and SLO Clinic | Write SLIs and SLOs for two real APIs; compute error budget burn; design an error-budget policy |
| 4 | Scale and Reliability Katas | Write an SLI/SLO for a real API; design a rate-limiter; apply Little's Law to a real queue; run a tabletop incident |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Distinguish latency, throughput, and utilization, and name a concrete metric and alert for each under USE, RED, and the Four Golden Signals.
- Read and explain a latency histogram in percentiles, including p50, p95, p99, and p999, and argue why an average can be green while p99 is already violating the SLO.
- State Amdahl's Law and the Universal Scalability Law in one line each, explain the coherence penalty, and compute where the scaling curve bends for a given serial fraction.
- Choose between vertical and horizontal scaling for a given workload with an explicit reason grounded in cost, failure mode, and operational load.
- Design a stateless tier and a session store that does not reintroduce a scaling bottleneck, and name the three places state can still creep in.
- Pick a caching pattern (cache-aside, write-through, write-behind, refresh-ahead) for a given read/write ratio and staleness tolerance, and name the invalidation trap each pattern hides.
- Write an SLI as a ratio of good-events to valid-events, an SLO as a numeric objective over a window, and compute the error budget and its burn rate.
- Name the three failure-mode families (cascading, correlated, gray), give a concrete example of each, and list the mitigation for each.
- Design a one-hour game day for a specific service: hypothesis, blast-radius control, abort conditions, and the evidence it would produce.
- Apply Little's Law to a real queue (
L = lambda * W), predict what happens as utilization crosses 0.8, and explain what back-pressure does that retries do not. - Distinguish rate limiting, load shedding, and admission control; pick one for a given overload scenario and name the failure mode each prevents.
- Produce a capacity number from data: mean, peak, peak-to-mean ratio, headroom target, and a growth extrapolation that you would defend to a skeptical reviewer.
- Explain what each observability pillar is cheap vs expensive at, and justify the decision of when to instrument with metrics, logs, or traces.
- Walk through the incident lifecycle for a real outage, including the Incident Commander role, the two clocks (MTTD and MTTR), and the mitigation-before-understanding principle.
- Author a blameless postmortem that distinguishes cause from contributing factor, produces action items that are specific and dated, and does not name and shame.
Outputs
- one USE / RED / Golden Signals dashboard sketch for a real or hypothetical service, with one metric per cell and one alert per row
- one p50/p95/p99/p999 histogram worked out by hand from synthetic data, including the two-systems-with-equal-mean example
- one Amdahl / USL worksheet computing the scaling ceiling for a specific serial fraction and a specific coherence penalty
- one two-page design memo scaling a monolithic service to 10x load using vertical, horizontal, stateless, and caching changes
- one SLI + SLO + error-budget policy document for a real API that you use, including burn-rate alerts and a freeze rule
- one chaos / game-day plan with hypothesis, blast radius, abort condition, and evidence list
- one Little's Law + rate-limit worksheet: given mean, peak, and service time, compute queue depth, pick a rate limit, and predict the failure mode of each choice
- one capacity plan: six-month extrapolation with stated assumptions and at least two scenarios (nominal and worst-case)
- one full incident writeup analyzing a public postmortem using the module vocabulary (cascading, gray, SLO burn, MTTD, MTTR, contributing factors)
- one mistake log with at least 10 entries tagged
averaged-instead-of-percentiled,no-fencing-on-cache-stampede,shared-failure-domain,retry-without-backoff,alert-on-cause-not-symptom, etc.
Completion Standard
You have completed Module 4 when all of these are true:
- you no longer talk about "average latency" in production contexts without being asked
- you can state Little's Law and apply it to a real queue without looking up the formula
- you can write an SLI and SLO that a product manager and an SRE would both accept
- you can distinguish cascading, correlated, and gray failures in a fresh incident narrative
- you can describe the incident lifecycle and name what an Incident Commander does that a responder does not
- you have authored at least one blameless postmortem and can describe why naming a person in a cause statement is a bug in the writeup
- you can design a chaos experiment with an explicit abort condition, and explain why "we tested it in staging" is insufficient
- you can defend a capacity number with a peak-to-mean ratio, a headroom target, and a growth model
If you are still saying "just add more machines" without naming the USL coefficient you are about to activate, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks (
system-design-primer,Fundamentals of Software Architecture) are selective reinforcement, not a second syllabus. Read only if stuckmeans try the concept page, self-check, and drill first.- External validated links (sre.google, brendangregg.com, how.complexsystems.fail, principlesofchaos.org) are targeted; read them when the concept page points to them.
- Because this module is the operating discipline for every distributed design you ship after it, hand-drawn histograms, hand-computed error budgets, and written postmortems are required, not optional.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3; sketch a USE/RED/Golden Signals dashboard for a service you know |
| 2 | Concepts 4-6; scale-out design memo for a monolithic example |
| 3 | Concepts 7-9; write one SLI/SLO for a real API |
| 4 | Concepts 10-12; Little's Law worksheet and capacity plan for a workload you have data for |
| 5 | Concepts 13-15 and Practice 1 (performance profiling lab) |
| 6 | Practice 2 (scaling design workshop) |
| 7 | Practice 3 (reliability and SLO clinic) |
| 8 | Practice 4 (katas), quiz, mistake-log cleanup |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread