Module 4: Scale, Reliability & Performance

Primary texts: The Site Reliability Engineering Book (Beyer et al., Google) and The Site Reliability Workbook as the operating model; System Design Primer for the scalability vocabulary; Fundamentals of Software Architecture (Richards, Ford) for architecture characteristics and fitness functions Selective support: Brendan Gregg's Systems Performance for the USE method; Richard Cook's How Complex Systems Fail; Principles of Chaos (principlesofchaos.org); Neil Gunther's Universal Scalability Law papers; the local system-design-primer chunks for latency numbers, caching levels, and availability patterns

This module is where "my service works on one box" becomes "my service survives a Black Friday spike, one bad deploy, and a regional outage without waking anyone up at 3am who shouldn't be awake." You already know distributed-systems fundamentals (Semester 6, Module 5) and replication and partitioning (Semester 6, Module 3). Here you learn the operating discipline that makes those designs stay up under real load and real failure.

Scope of This Module

This module is not "a catalog of scalability tricks." It is where reliability stops being a vibe and becomes a number, where performance stops being a benchmark and becomes a distribution, and where incident response stops being heroics and becomes a lifecycle.

What it covers in depth:

latency, throughput, and utilization as three different quantities, and the USE, RED, and Four Golden Signals families that monitor them
why averages hide outages and why the entire production dialect of "performance" is written in percentiles
Amdahl's Law and the Universal Scalability Law, and why throwing machines at a problem has a ceiling
vertical versus horizontal scaling and why "stateless" is the word that separates the two worlds
statelessness, sticky sessions, and external session stores
caching at every layer of the stack, cache update patterns, and why "there are only two hard problems in CS" is about this
Service Level Indicators, Service Level Objectives, and error budgets as the contract between reliability and velocity
cascading, correlated, and gray failures - the three failure patterns that defeat naive redundancy
chaos engineering and game days as the empirical discipline that replaces "we tested it in staging"
Little's Law, queuing intuition, and back-pressure as the math behind overload
load shedding, rate limiting, and admission control as the levers you actually pull under overload
capacity planning with headroom, peaks, and growth curves
the three pillars of observability: metrics, logs, traces
the incident lifecycle: detect, triage, mitigate, resolve, review
blameless postmortems and the learning systems that turn one outage into structural improvements

What it deliberately does not try to finish here:

a full performance-tuning curriculum (flame graphs, kernel tracing, perf, eBPF)
database-specific tuning beyond what Semester 6 covered
a complete incident-command playbook (that belongs in Module 5 leadership context)
SLO math in its deepest form (multi-window, multi-burn-rate alerts get referenced but not exhaustively modeled)

If you leave this module still saying "we'll scale it when we need to" or "the average response time looks fine," the module is not complete.

Before You Start

Answer these closed-book before starting the main path:

Your service's average latency is 120ms. Is that a good number? Why is the answer almost always "I don't have enough information"?
Your single-instance service handles 1,000 req/s. You add nine more instances behind a load balancer. What is the upper bound on new throughput, and why is it almost never 10,000?
A bug causes one microservice to time out. The other six services that depend on it start returning 500s. What is this pattern called and what protects against it?
Your team's SLO is "99.9% availability per quarter." Over 90 days, how much downtime does the error budget permit?
It's 3am and the pager just went off. What is the very first thing you do - and why is it not "open the code"?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path.

2-3 solid answers

Continue, but expect extra time in Cluster 3 (Reliability Engineering) and Cluster 4 (Capacity Planning and Load).

0-1 solid answers

Revisit Semester 6 Module 5 (Distributed Systems Fundamentals) briefly. Reliability work here only makes sense on top of a concrete mental model of partial failure and retries.

What This Module Is For

Scale and reliability are the two places single-machine engineering intuitions quietly go to die. Later work in your career will repeatedly ask:

my dashboards are green but users are complaining - what is wrong with my dashboards?
my service is at 40% CPU but request latency just doubled - what else is saturated?
my backend started timing out and now my frontend is also down - why did the blast radius grow?
my retry storm is making the outage worse - how do I shed load instead of amplifying it?
my SLO is 99.9% and we burned half the error budget in three days - do we ship the feature or freeze?

This module builds the reasoning needed for:

production on-call work where you must reason about tail latency, not averages
architecture reviews where "what happens at 10x load" is an actual answer, not a shrug
SLO-driven development, error budgets, and the engineering-product negotiation they enable
postmortem authoring and the cultural discipline that makes learning systematic
the Semester 8 capstone and Module 5 leadership work

You are learning to reason about load and failure without handwaving.

Concept Map

How To Use This Module

Work in order. Later clusters only make sense if the earlier performance vocabulary is stable.

Cluster 1: Performance Reasoning

Order	Concept	Type	Focus
1	Latency, Throughput, Utilization, and the USE / RED / Four Golden Signals	PRIMARY	The three distinct quantities every engineer must separate, and the three framings that organize every dashboard you will ever build
2	Percentile Latency and Why Averages Lie	PRIMARY	p50, p95, p99, p999, tail amplification, and the worked example where two systems with identical averages have very different user experience
3	Amdahl's Law and the Universal Scalability Law: The Limits of Parallelism	PRIMARY	Why linear scale-out is impossible, where the ceiling is, and why more machines can sometimes make things slower

Cluster mastery check: Given a latency histogram, can you distinguish p50, p95, p99 in words; estimate which golden signal would alert first under overload; and name the USL coefficient that makes scaling go backwards?

Cluster 2: Scaling Strategies

Order	Concept	Type	Focus
4	Vertical vs Horizontal Scaling: The When and Why	PRIMARY	Scale up vs scale out, cost curves, operational consequences, and the failure modes that determine the choice
5	Statelessness, Sticky Sessions, and Session Stores	PRIMARY	Why stateless is the precondition for cheap horizontal scaling, and the three places session state can live
6	Caching Strategies at Every Layer	PRIMARY	Client, CDN, reverse proxy, application, and database caches; cache-aside, write-through, write-behind, refresh-ahead; invalidation as the hard part

Cluster mastery check: Given a synchronous request-response service at 1k req/s needing to grow to 50k, can you name at least five caching-or-stateless changes you would make, in order of impact, and the new failure modes each introduces?

Cluster 3: Reliability Engineering

Order	Concept	Type	Focus
7	SLIs, SLOs, and Error Budgets	PRIMARY	Indicators, objectives, the error-budget math, and the negotiation the budget enables between reliability and velocity
8	Failure Modes: Cascading, Correlated, and Gray	PRIMARY	The three patterns that defeat naive redundancy, plus the concrete mitigations: timeouts, circuit breakers, bulkheads, and hedged requests
9	Chaos Engineering and Game Days	PRIMARY	Experimentation as reliability evidence, the four principles of chaos, the minimum-viable game day, and why staging is not an answer

Cluster mastery check: Given a 99.9% SLO that is being burned in 30 days, can you write the error-budget policy, identify one cascading and one gray failure your service is exposed to, and design a one-hour game day that would test one of them?

Cluster 4: Capacity Planning and Load

Order	Concept	Type	Focus
10	Back-Pressure, Queuing Theory Basics, and Little's Law	PRIMARY	`L = lambda * W`, why utilization past 80% is a latency cliff, and what back-pressure actually sends upstream
11	Load Shedding, Rate Limiting, and Admission Control	PRIMARY	Token bucket, leaky bucket, concurrency limits, adaptive shedding, and the cost of work accepted but not served
12	Capacity Planning: Headroom, Peaks, and Growth Modeling	PRIMARY	Peak-to-mean ratios, headroom targets, growth extrapolation, and how to produce a defensible capacity number you do not need to apologize for

Cluster mastery check: For a service with mean 200 req/s and peak 1,500 req/s, can you estimate Little's Law quantities, pick a rate-limit, set a headroom target, and produce a six-month capacity forecast from first principles?

Cluster 5: Incident and Observability

Order	Concept	Type	Focus
13	Observability Three Pillars: Metrics, Logs, Traces	PRIMARY	What each pillar is cheap at, what each is expensive at, the known-vs-unknown-unknowns framing, and why one pillar alone is not observability
14	Incident Lifecycle: Detect, Triage, Mitigate, Resolve, Review	PRIMARY	The five phases, the two clocks that matter (MTTD and MTTR), the Incident Commander role, and why "fix it before you understand it" is the right answer
15	Blameless Postmortems and Learning from Incidents	SUPPORTING	The postmortem template, the blameless contract, the difference between cause and contributing factor, and why the action-item list is where most postmortems fail

Cluster mastery check: Given a one-hour partial outage traced to a slow dependency and a misconfigured timeout, can you walk through the full lifecycle from page to postmortem, and distinguish the actions that should block the review from those that should not?

Then work these practice pages:

Order	Practice path	Focus
1	Performance Profiling Lab	USE method walkthrough, percentile histograms by hand, Amdahl speedup calculation
2	Scaling Design Workshop	Take a monolithic service from 1k to 50k req/s by design; defend each choice
3	Reliability and SLO Clinic	Write SLIs and SLOs for two real APIs; compute error budget burn; design an error-budget policy
4	Scale and Reliability Katas	Write an SLI/SLO for a real API; design a rate-limiter; apply Little's Law to a real queue; run a tabletop incident

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Distinguish latency, throughput, and utilization, and name a concrete metric and alert for each under USE, RED, and the Four Golden Signals.
Read and explain a latency histogram in percentiles, including p50, p95, p99, and p999, and argue why an average can be green while p99 is already violating the SLO.
State Amdahl's Law and the Universal Scalability Law in one line each, explain the coherence penalty, and compute where the scaling curve bends for a given serial fraction.
Choose between vertical and horizontal scaling for a given workload with an explicit reason grounded in cost, failure mode, and operational load.
Design a stateless tier and a session store that does not reintroduce a scaling bottleneck, and name the three places state can still creep in.
Pick a caching pattern (cache-aside, write-through, write-behind, refresh-ahead) for a given read/write ratio and staleness tolerance, and name the invalidation trap each pattern hides.
Write an SLI as a ratio of good-events to valid-events, an SLO as a numeric objective over a window, and compute the error budget and its burn rate.
Name the three failure-mode families (cascading, correlated, gray), give a concrete example of each, and list the mitigation for each.
Design a one-hour game day for a specific service: hypothesis, blast-radius control, abort conditions, and the evidence it would produce.
Apply Little's Law to a real queue (L = lambda * W), predict what happens as utilization crosses 0.8, and explain what back-pressure does that retries do not.
Distinguish rate limiting, load shedding, and admission control; pick one for a given overload scenario and name the failure mode each prevents.
Produce a capacity number from data: mean, peak, peak-to-mean ratio, headroom target, and a growth extrapolation that you would defend to a skeptical reviewer.
Explain what each observability pillar is cheap vs expensive at, and justify the decision of when to instrument with metrics, logs, or traces.
Walk through the incident lifecycle for a real outage, including the Incident Commander role, the two clocks (MTTD and MTTR), and the mitigation-before-understanding principle.
Author a blameless postmortem that distinguishes cause from contributing factor, produces action items that are specific and dated, and does not name and shame.

Outputs

one USE / RED / Golden Signals dashboard sketch for a real or hypothetical service, with one metric per cell and one alert per row
one p50/p95/p99/p999 histogram worked out by hand from synthetic data, including the two-systems-with-equal-mean example
one Amdahl / USL worksheet computing the scaling ceiling for a specific serial fraction and a specific coherence penalty
one two-page design memo scaling a monolithic service to 10x load using vertical, horizontal, stateless, and caching changes
one SLI + SLO + error-budget policy document for a real API that you use, including burn-rate alerts and a freeze rule
one chaos / game-day plan with hypothesis, blast radius, abort condition, and evidence list
one Little's Law + rate-limit worksheet: given mean, peak, and service time, compute queue depth, pick a rate limit, and predict the failure mode of each choice
one capacity plan: six-month extrapolation with stated assumptions and at least two scenarios (nominal and worst-case)
one full incident writeup analyzing a public postmortem using the module vocabulary (cascading, gray, SLO burn, MTTD, MTTR, contributing factors)
one mistake log with at least 10 entries tagged averaged-instead-of-percentiled, no-fencing-on-cache-stampede, shared-failure-domain, retry-without-backoff, alert-on-cause-not-symptom, etc.

Completion Standard

You have completed Module 4 when all of these are true:

you no longer talk about "average latency" in production contexts without being asked
you can state Little's Law and apply it to a real queue without looking up the formula
you can write an SLI and SLO that a product manager and an SRE would both accept
you can distinguish cascading, correlated, and gray failures in a fresh incident narrative
you can describe the incident lifecycle and name what an Incident Commander does that a responder does not
you have authored at least one blameless postmortem and can describe why naming a person in a cause statement is a bug in the writeup
you can design a chaos experiment with an explicit abort condition, and explain why "we tested it in staging" is insufficient
you can defend a capacity number with a peak-to-mean ratio, a headroom target, and a growth model

If you are still saying "just add more machines" without naming the USL coefficient you are about to activate, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks (system-design-primer, Fundamentals of Software Architecture) are selective reinforcement, not a second syllabus.
Read only if stuck means try the concept page, self-check, and drill first.
External validated links (sre.google, brendangregg.com, how.complexsystems.fail, principlesofchaos.org) are targeted; read them when the concept page points to them.
Because this module is the operating discipline for every distributed design you ship after it, hand-drawn histograms, hand-computed error budgets, and written postmortems are required, not optional.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3; sketch a USE/RED/Golden Signals dashboard for a service you know
2	Concepts 4-6; scale-out design memo for a monolithic example
3	Concepts 7-9; write one SLI/SLO for a real API
4	Concepts 10-12; Little's Law worksheet and capacity plan for a workload you have data for
5	Concepts 13-15 and Practice 1 (performance profiling lab)
6	Practice 2 (scaling design workshop)
7	Practice 3 (reliability and SLO clinic)
8	Practice 4 (katas), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: Performance Reasoning​

Cluster 2: Scaling Strategies​

Cluster 3: Reliability Engineering​

Cluster 4: Capacity Planning and Load​

Cluster 5: Incident and Observability​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​