Skip to main content

Module 4: Scale, Reliability & Performance

Primary texts: The Site Reliability Engineering Book (Beyer et al., Google) and The Site Reliability Workbook as the operating model; System Design Primer for the scalability vocabulary; Fundamentals of Software Architecture (Richards, Ford) for architecture characteristics and fitness functions Selective support: Brendan Gregg's Systems Performance for the USE method; Richard Cook's How Complex Systems Fail; Principles of Chaos (principlesofchaos.org); Neil Gunther's Universal Scalability Law papers; the local system-design-primer chunks for latency numbers, caching levels, and availability patterns

This module is where "my service works on one box" becomes "my service survives a Black Friday spike, one bad deploy, and a regional outage without waking anyone up at 3am who shouldn't be awake." You already know distributed-systems fundamentals (Semester 6, Module 5) and replication and partitioning (Semester 6, Module 3). Here you learn the operating discipline that makes those designs stay up under real load and real failure.


Scope of This Module

This module is not "a catalog of scalability tricks." It is where reliability stops being a vibe and becomes a number, where performance stops being a benchmark and becomes a distribution, and where incident response stops being heroics and becomes a lifecycle.

What it covers in depth:

  • latency, throughput, and utilization as three different quantities, and the USE, RED, and Four Golden Signals families that monitor them
  • why averages hide outages and why the entire production dialect of "performance" is written in percentiles
  • Amdahl's Law and the Universal Scalability Law, and why throwing machines at a problem has a ceiling
  • vertical versus horizontal scaling and why "stateless" is the word that separates the two worlds
  • statelessness, sticky sessions, and external session stores
  • caching at every layer of the stack, cache update patterns, and why "there are only two hard problems in CS" is about this
  • Service Level Indicators, Service Level Objectives, and error budgets as the contract between reliability and velocity
  • cascading, correlated, and gray failures - the three failure patterns that defeat naive redundancy
  • chaos engineering and game days as the empirical discipline that replaces "we tested it in staging"
  • Little's Law, queuing intuition, and back-pressure as the math behind overload
  • load shedding, rate limiting, and admission control as the levers you actually pull under overload
  • capacity planning with headroom, peaks, and growth curves
  • the three pillars of observability: metrics, logs, traces
  • the incident lifecycle: detect, triage, mitigate, resolve, review
  • blameless postmortems and the learning systems that turn one outage into structural improvements

What it deliberately does not try to finish here:

  • a full performance-tuning curriculum (flame graphs, kernel tracing, perf, eBPF)
  • database-specific tuning beyond what Semester 6 covered
  • a complete incident-command playbook (that belongs in Module 5 leadership context)
  • SLO math in its deepest form (multi-window, multi-burn-rate alerts get referenced but not exhaustively modeled)

If you leave this module still saying "we'll scale it when we need to" or "the average response time looks fine," the module is not complete.


Before You Start

Answer these closed-book before starting the main path:

  1. Your service's average latency is 120ms. Is that a good number? Why is the answer almost always "I don't have enough information"?
  2. Your single-instance service handles 1,000 req/s. You add nine more instances behind a load balancer. What is the upper bound on new throughput, and why is it almost never 10,000?
  3. A bug causes one microservice to time out. The other six services that depend on it start returning 500s. What is this pattern called and what protects against it?
  4. Your team's SLO is "99.9% availability per quarter." Over 90 days, how much downtime does the error budget permit?
  5. It's 3am and the pager just went off. What is the very first thing you do - and why is it not "open the code"?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but expect extra time in Cluster 3 (Reliability Engineering) and Cluster 4 (Capacity Planning and Load).

0-1 solid answers

  • Revisit Semester 6 Module 5 (Distributed Systems Fundamentals) briefly. Reliability work here only makes sense on top of a concrete mental model of partial failure and retries.

What This Module Is For

Scale and reliability are the two places single-machine engineering intuitions quietly go to die. Later work in your career will repeatedly ask:

  • my dashboards are green but users are complaining - what is wrong with my dashboards?
  • my service is at 40% CPU but request latency just doubled - what else is saturated?
  • my backend started timing out and now my frontend is also down - why did the blast radius grow?
  • my retry storm is making the outage worse - how do I shed load instead of amplifying it?
  • my SLO is 99.9% and we burned half the error budget in three days - do we ship the feature or freeze?

This module builds the reasoning needed for:

  • production on-call work where you must reason about tail latency, not averages
  • architecture reviews where "what happens at 10x load" is an actual answer, not a shrug
  • SLO-driven development, error budgets, and the engineering-product negotiation they enable
  • postmortem authoring and the cultural discipline that makes learning systematic
  • the Semester 8 capstone and Module 5 leadership work

You are learning to reason about load and failure without handwaving.


Concept Map


How To Use This Module

Work in order. Later clusters only make sense if the earlier performance vocabulary is stable.

Cluster 1: Performance Reasoning

OrderConceptTypeFocus
1Latency, Throughput, Utilization, and the USE / RED / Four Golden SignalsPRIMARYThe three distinct quantities every engineer must separate, and the three framings that organize every dashboard you will ever build
2Percentile Latency and Why Averages LiePRIMARYp50, p95, p99, p999, tail amplification, and the worked example where two systems with identical averages have very different user experience
3Amdahl's Law and the Universal Scalability Law: The Limits of ParallelismPRIMARYWhy linear scale-out is impossible, where the ceiling is, and why more machines can sometimes make things slower

Cluster mastery check: Given a latency histogram, can you distinguish p50, p95, p99 in words; estimate which golden signal would alert first under overload; and name the USL coefficient that makes scaling go backwards?

Cluster 2: Scaling Strategies

OrderConceptTypeFocus
4Vertical vs Horizontal Scaling: The When and WhyPRIMARYScale up vs scale out, cost curves, operational consequences, and the failure modes that determine the choice
5Statelessness, Sticky Sessions, and Session StoresPRIMARYWhy stateless is the precondition for cheap horizontal scaling, and the three places session state can live
6Caching Strategies at Every LayerPRIMARYClient, CDN, reverse proxy, application, and database caches; cache-aside, write-through, write-behind, refresh-ahead; invalidation as the hard part

Cluster mastery check: Given a synchronous request-response service at 1k req/s needing to grow to 50k, can you name at least five caching-or-stateless changes you would make, in order of impact, and the new failure modes each introduces?

Cluster 3: Reliability Engineering

OrderConceptTypeFocus
7SLIs, SLOs, and Error BudgetsPRIMARYIndicators, objectives, the error-budget math, and the negotiation the budget enables between reliability and velocity
8Failure Modes: Cascading, Correlated, and GrayPRIMARYThe three patterns that defeat naive redundancy, plus the concrete mitigations: timeouts, circuit breakers, bulkheads, and hedged requests
9Chaos Engineering and Game DaysPRIMARYExperimentation as reliability evidence, the four principles of chaos, the minimum-viable game day, and why staging is not an answer

Cluster mastery check: Given a 99.9% SLO that is being burned in 30 days, can you write the error-budget policy, identify one cascading and one gray failure your service is exposed to, and design a one-hour game day that would test one of them?

Cluster 4: Capacity Planning and Load

OrderConceptTypeFocus
10Back-Pressure, Queuing Theory Basics, and Little's LawPRIMARYL = lambda * W, why utilization past 80% is a latency cliff, and what back-pressure actually sends upstream
11Load Shedding, Rate Limiting, and Admission ControlPRIMARYToken bucket, leaky bucket, concurrency limits, adaptive shedding, and the cost of work accepted but not served
12Capacity Planning: Headroom, Peaks, and Growth ModelingPRIMARYPeak-to-mean ratios, headroom targets, growth extrapolation, and how to produce a defensible capacity number you do not need to apologize for

Cluster mastery check: For a service with mean 200 req/s and peak 1,500 req/s, can you estimate Little's Law quantities, pick a rate-limit, set a headroom target, and produce a six-month capacity forecast from first principles?

Cluster 5: Incident and Observability

OrderConceptTypeFocus
13Observability Three Pillars: Metrics, Logs, TracesPRIMARYWhat each pillar is cheap at, what each is expensive at, the known-vs-unknown-unknowns framing, and why one pillar alone is not observability
14Incident Lifecycle: Detect, Triage, Mitigate, Resolve, ReviewPRIMARYThe five phases, the two clocks that matter (MTTD and MTTR), the Incident Commander role, and why "fix it before you understand it" is the right answer
15Blameless Postmortems and Learning from IncidentsSUPPORTINGThe postmortem template, the blameless contract, the difference between cause and contributing factor, and why the action-item list is where most postmortems fail

Cluster mastery check: Given a one-hour partial outage traced to a slow dependency and a misconfigured timeout, can you walk through the full lifecycle from page to postmortem, and distinguish the actions that should block the review from those that should not?

Then work these practice pages:

OrderPractice pathFocus
1Performance Profiling LabUSE method walkthrough, percentile histograms by hand, Amdahl speedup calculation
2Scaling Design WorkshopTake a monolithic service from 1k to 50k req/s by design; defend each choice
3Reliability and SLO ClinicWrite SLIs and SLOs for two real APIs; compute error budget burn; design an error-budget policy
4Scale and Reliability KatasWrite an SLI/SLO for a real API; design a rate-limiter; apply Little's Law to a real queue; run a tabletop incident

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Distinguish latency, throughput, and utilization, and name a concrete metric and alert for each under USE, RED, and the Four Golden Signals.
  2. Read and explain a latency histogram in percentiles, including p50, p95, p99, and p999, and argue why an average can be green while p99 is already violating the SLO.
  3. State Amdahl's Law and the Universal Scalability Law in one line each, explain the coherence penalty, and compute where the scaling curve bends for a given serial fraction.
  4. Choose between vertical and horizontal scaling for a given workload with an explicit reason grounded in cost, failure mode, and operational load.
  5. Design a stateless tier and a session store that does not reintroduce a scaling bottleneck, and name the three places state can still creep in.
  6. Pick a caching pattern (cache-aside, write-through, write-behind, refresh-ahead) for a given read/write ratio and staleness tolerance, and name the invalidation trap each pattern hides.
  7. Write an SLI as a ratio of good-events to valid-events, an SLO as a numeric objective over a window, and compute the error budget and its burn rate.
  8. Name the three failure-mode families (cascading, correlated, gray), give a concrete example of each, and list the mitigation for each.
  9. Design a one-hour game day for a specific service: hypothesis, blast-radius control, abort conditions, and the evidence it would produce.
  10. Apply Little's Law to a real queue (L = lambda * W), predict what happens as utilization crosses 0.8, and explain what back-pressure does that retries do not.
  11. Distinguish rate limiting, load shedding, and admission control; pick one for a given overload scenario and name the failure mode each prevents.
  12. Produce a capacity number from data: mean, peak, peak-to-mean ratio, headroom target, and a growth extrapolation that you would defend to a skeptical reviewer.
  13. Explain what each observability pillar is cheap vs expensive at, and justify the decision of when to instrument with metrics, logs, or traces.
  14. Walk through the incident lifecycle for a real outage, including the Incident Commander role, the two clocks (MTTD and MTTR), and the mitigation-before-understanding principle.
  15. Author a blameless postmortem that distinguishes cause from contributing factor, produces action items that are specific and dated, and does not name and shame.

Outputs

  • one USE / RED / Golden Signals dashboard sketch for a real or hypothetical service, with one metric per cell and one alert per row
  • one p50/p95/p99/p999 histogram worked out by hand from synthetic data, including the two-systems-with-equal-mean example
  • one Amdahl / USL worksheet computing the scaling ceiling for a specific serial fraction and a specific coherence penalty
  • one two-page design memo scaling a monolithic service to 10x load using vertical, horizontal, stateless, and caching changes
  • one SLI + SLO + error-budget policy document for a real API that you use, including burn-rate alerts and a freeze rule
  • one chaos / game-day plan with hypothesis, blast radius, abort condition, and evidence list
  • one Little's Law + rate-limit worksheet: given mean, peak, and service time, compute queue depth, pick a rate limit, and predict the failure mode of each choice
  • one capacity plan: six-month extrapolation with stated assumptions and at least two scenarios (nominal and worst-case)
  • one full incident writeup analyzing a public postmortem using the module vocabulary (cascading, gray, SLO burn, MTTD, MTTR, contributing factors)
  • one mistake log with at least 10 entries tagged averaged-instead-of-percentiled, no-fencing-on-cache-stampede, shared-failure-domain, retry-without-backoff, alert-on-cause-not-symptom, etc.

Completion Standard

You have completed Module 4 when all of these are true:

  • you no longer talk about "average latency" in production contexts without being asked
  • you can state Little's Law and apply it to a real queue without looking up the formula
  • you can write an SLI and SLO that a product manager and an SRE would both accept
  • you can distinguish cascading, correlated, and gray failures in a fresh incident narrative
  • you can describe the incident lifecycle and name what an Incident Commander does that a responder does not
  • you have authored at least one blameless postmortem and can describe why naming a person in a cause statement is a bug in the writeup
  • you can design a chaos experiment with an explicit abort condition, and explain why "we tested it in staging" is insufficient
  • you can defend a capacity number with a peak-to-mean ratio, a headroom target, and a growth model

If you are still saying "just add more machines" without naming the USL coefficient you are about to activate, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks (system-design-primer, Fundamentals of Software Architecture) are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • External validated links (sre.google, brendangregg.com, how.complexsystems.fail, principlesofchaos.org) are targeted; read them when the concept page points to them.
  • Because this module is the operating discipline for every distributed design you ship after it, hand-drawn histograms, hand-computed error budgets, and written postmortems are required, not optional.

Suggested Weekly Flow

DayWork
1Concepts 1-3; sketch a USE/RED/Golden Signals dashboard for a service you know
2Concepts 4-6; scale-out design memo for a monolithic example
3Concepts 7-9; write one SLI/SLO for a real API
4Concepts 10-12; Little's Law worksheet and capacity plan for a workload you have data for
5Concepts 13-15 and Practice 1 (performance profiling lab)
6Practice 2 (scaling design workshop)
7Practice 3 (reliability and SLO clinic)
8Practice 4 (katas), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread