Module 4: Scale, Reliability & Performance: Case Studies

These case studies make reliability measurable: SLOs, error budgets, latency percentiles, overload controls, incident response, and capacity limits.

Case Study 1: Error Budget Stops Feature Work

Scenario: A team has a 99.9% quarterly availability SLO but keeps shipping risky changes after burning the whole error budget in two weeks.

Source anchor: Google SRE's Error Budget Policy for Service Reliability, which explains how error budget policy balances reliability and release velocity.

Module concepts: SLI, SLO, error budget, release policy.

Wrong Approach

SLOs are dashboards nobody uses for decisions.

Better Approach

Connect budget to policy:

If budget healthy:
  normal release velocity

If budget burn high:
  freeze risky launches
  prioritize reliability work

If budget exhausted:
  leadership review before launches

Tradeoff Table

Choice	Gain	Cost
no SLO	flexible	no reliability contract
SLO without policy	visible numbers	no behavior change
budget policy	balances speed/reliability	uncomfortable prioritization
overly strict SLO	high reliability	slow delivery and high cost

Required Artifact

Write an SLO policy: SLI, target, window, budget, burn triggers, and release consequences.

Case Study 2: Average Latency Hides A Tail Outage

Scenario: Average latency is 120 ms. p99 is 4 seconds because a small percentage of requests block on a slow dependency.

Source anchor: Google SRE's Monitoring Distributed Systems, which names the four golden signals: latency, traffic, errors, and saturation.

Module concepts: percentiles, golden signals, RED/USE, dependency latency.

Wrong Approach

Report average latency as the performance story.

Better Approach

Track distributions and segment by route/dependency:

latency:
  p50, p90, p95, p99

traffic:
  requests/sec

errors:
  rate by class

saturation:
  queue depth, CPU, connection pool

Tradeoff Table

Metric	Gain	Risk
average	simple	hides tail
p95/p99	user pain visible	noisier
per-route	actionable	more cardinality
dependency timing	finds bottleneck	instrumentation work

Required Artifact

Create a dashboard spec with golden signals, route segmentation, and alert threshold.

Case Study 3: Retry Storm Under Overload

Scenario: A dependency slows down. Callers retry immediately and multiply load until the system collapses.

Source anchor: Amazon Builders' Library Timeouts, retries, and backoff with jitter explains retries, overload, backoff, and jitter.

Module concepts: overload, retry amplification, backoff, jitter, load shedding.

Wrong Approach

Retry all failures immediately.

Better Approach

Set a retry budget:

timeout:
  based on caller deadline

retries:
  max 2 attempts
  exponential backoff
  jitter

overload:
  load shed or degrade

Tradeoff Table

Choice	Gain	Cost
no retries	simple	transient failures leak
immediate retries	quick for rare blips	overload amplification
backoff + jitter	smooths load	slower recovery
load shedding	protects core	rejected requests

Required Artifact

Write timeout/retry/load-shed policy for one dependency.

Case Study 4: Blameless Postmortem That Produces Real Work

Scenario: An incident review spends 45 minutes asking who deployed the change and 5 minutes asking why guardrails failed.

Source anchor: Atlassian's How to run a blameless postmortem, which frames postmortems around learning and improvement without fear.

Module concepts: incident lifecycle, blamelessness, corrective action, learning system.

Wrong Approach

Find the person who made the mistake.

Better Approach

Analyze the system:

Detection:
  why did alert fire late?

Mitigation:
  why was rollback slow?

Prevention:
  what guardrail would catch this class?

Learning:
  owner, due date, verification

Tradeoff Table

Approach	Gain	Cost
blame	emotional closure	hides real causes
blameless review	learning	requires discipline
action items	improvement	follow-up ownership
no review	saves time	repeat incidents

Required Artifact

Write a postmortem with timeline, contributing factors, impact, what went well, what went poorly, and verified follow-ups.

Case Study 5: Capacity Plan That Ignores Saturation

Scenario: A service handles 1,000 req/s on one instance. The team assumes ten instances handle 10,000 req/s. At 6,000 req/s, database connections and queueing dominate.

Source anchor: Neil Gunther's Universal Scalability Law is a standard model for scalability limits from contention and coherency. See Universal Scalability Law resources.

Module concepts: capacity planning, saturation, contention, queueing, horizontal scaling limit.

Wrong Approach

Assume throughput scales linearly with instances.

Better Approach

Measure bottlenecks:

per instance capacity:
shared DB bottleneck:
connection pool:
queue depth:
cache hit ratio:
peak/headroom:

Tradeoff Table

Choice	Gain	Cost
add instances	fast app scaling	shifts bottleneck
shard DB	more write capacity	operational complexity
cache	reduces backend load	invalidation/staleness
admission control	protects latency	rejects excess

Required Artifact

Create a capacity worksheet with current load, peak multiplier, bottleneck, headroom, and scale trigger.

Source Map

Source	Use it for
Google SRE: Error Budget Policy	connecting SLOs to release policy
Google SRE: Monitoring Distributed Systems	four golden signals and monitoring discipline
AWS Builders' Library: Timeouts and retries	backoff, jitter, overload prevention
Atlassian: Blameless postmortems	incident review practice
Universal Scalability Law	contention/coherency limits on scaling

Completion Standard

At least three artifacts are completed.
At least one SLO policy includes budget consequences.
At least one dashboard uses percentiles.
At least one incident review has verified follow-ups.

Case Study 1: Error Budget Stops Feature Work​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 2: Average Latency Hides A Tail Outage​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 3: Retry Storm Under Overload​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 4: Blameless Postmortem That Produces Real Work​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Case Study 5: Capacity Plan That Ignores Saturation​

Wrong Approach​

Better Approach​

Tradeoff Table​

Required Artifact​

Source Map​

Completion Standard​

Case Study 1: Error Budget Stops Feature Work

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 2: Average Latency Hides A Tail Outage

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 3: Retry Storm Under Overload

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 4: Blameless Postmortem That Produces Real Work

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Case Study 5: Capacity Plan That Ignores Saturation

Wrong Approach

Better Approach

Tradeoff Table

Required Artifact

Source Map

Completion Standard