Module 4: Scale, Reliability & Performance: Case Studies
These case studies make reliability measurable: SLOs, error budgets, latency percentiles, overload controls, incident response, and capacity limits.
Case Study 1: Error Budget Stops Feature Work
Scenario: A team has a 99.9% quarterly availability SLO but keeps shipping risky changes after burning the whole error budget in two weeks.
Source anchor: Google SRE's Error Budget Policy for Service Reliability, which explains how error budget policy balances reliability and release velocity.
Module concepts: SLI, SLO, error budget, release policy.
Wrong Approach
SLOs are dashboards nobody uses for decisions.
Better Approach
Connect budget to policy:
If budget healthy:
normal release velocity
If budget burn high:
freeze risky launches
prioritize reliability work
If budget exhausted:
leadership review before launches
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no SLO | flexible | no reliability contract |
| SLO without policy | visible numbers | no behavior change |
| budget policy | balances speed/reliability | uncomfortable prioritization |
| overly strict SLO | high reliability | slow delivery and high cost |
Required Artifact
Write an SLO policy: SLI, target, window, budget, burn triggers, and release consequences.
Case Study 2: Average Latency Hides A Tail Outage
Scenario: Average latency is 120 ms. p99 is 4 seconds because a small percentage of requests block on a slow dependency.
Source anchor: Google SRE's Monitoring Distributed Systems, which names the four golden signals: latency, traffic, errors, and saturation.
Module concepts: percentiles, golden signals, RED/USE, dependency latency.
Wrong Approach
Report average latency as the performance story.
Better Approach
Track distributions and segment by route/dependency:
latency:
p50, p90, p95, p99
traffic:
requests/sec
errors:
rate by class
saturation:
queue depth, CPU, connection pool
Tradeoff Table
| Metric | Gain | Risk |
|---|---|---|
| average | simple | hides tail |
| p95/p99 | user pain visible | noisier |
| per-route | actionable | more cardinality |
| dependency timing | finds bottleneck | instrumentation work |
Required Artifact
Create a dashboard spec with golden signals, route segmentation, and alert threshold.
Case Study 3: Retry Storm Under Overload
Scenario: A dependency slows down. Callers retry immediately and multiply load until the system collapses.
Source anchor: Amazon Builders' Library Timeouts, retries, and backoff with jitter explains retries, overload, backoff, and jitter.
Module concepts: overload, retry amplification, backoff, jitter, load shedding.
Wrong Approach
Retry all failures immediately.
Better Approach
Set a retry budget:
timeout:
based on caller deadline
retries:
max 2 attempts
exponential backoff
jitter
overload:
load shed or degrade
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| no retries | simple | transient failures leak |
| immediate retries | quick for rare blips | overload amplification |
| backoff + jitter | smooths load | slower recovery |
| load shedding | protects core | rejected requests |
Required Artifact
Write timeout/retry/load-shed policy for one dependency.
Case Study 4: Blameless Postmortem That Produces Real Work
Scenario: An incident review spends 45 minutes asking who deployed the change and 5 minutes asking why guardrails failed.
Source anchor: Atlassian's How to run a blameless postmortem, which frames postmortems around learning and improvement without fear.
Module concepts: incident lifecycle, blamelessness, corrective action, learning system.
Wrong Approach
Find the person who made the mistake.
Better Approach
Analyze the system:
Detection:
why did alert fire late?
Mitigation:
why was rollback slow?
Prevention:
what guardrail would catch this class?
Learning:
owner, due date, verification
Tradeoff Table
| Approach | Gain | Cost |
|---|---|---|
| blame | emotional closure | hides real causes |
| blameless review | learning | requires discipline |
| action items | improvement | follow-up ownership |
| no review | saves time | repeat incidents |
Required Artifact
Write a postmortem with timeline, contributing factors, impact, what went well, what went poorly, and verified follow-ups.
Case Study 5: Capacity Plan That Ignores Saturation
Scenario: A service handles 1,000 req/s on one instance. The team assumes ten instances handle 10,000 req/s. At 6,000 req/s, database connections and queueing dominate.
Source anchor: Neil Gunther's Universal Scalability Law is a standard model for scalability limits from contention and coherency. See Universal Scalability Law resources.
Module concepts: capacity planning, saturation, contention, queueing, horizontal scaling limit.
Wrong Approach
Assume throughput scales linearly with instances.
Better Approach
Measure bottlenecks:
per instance capacity:
shared DB bottleneck:
connection pool:
queue depth:
cache hit ratio:
peak/headroom:
Tradeoff Table
| Choice | Gain | Cost |
|---|---|---|
| add instances | fast app scaling | shifts bottleneck |
| shard DB | more write capacity | operational complexity |
| cache | reduces backend load | invalidation/staleness |
| admission control | protects latency | rejects excess |
Required Artifact
Create a capacity worksheet with current load, peak multiplier, bottleneck, headroom, and scale trigger.
Source Map
| Source | Use it for |
|---|---|
| Google SRE: Error Budget Policy | connecting SLOs to release policy |
| Google SRE: Monitoring Distributed Systems | four golden signals and monitoring discipline |
| AWS Builders' Library: Timeouts and retries | backoff, jitter, overload prevention |
| Atlassian: Blameless postmortems | incident review practice |
| Universal Scalability Law | contention/coherency limits on scaling |
Completion Standard
- At least three artifacts are completed.
- At least one SLO policy includes budget consequences.
- At least one dashboard uses percentiles.
- At least one incident review has verified follow-ups.