Skip to main content

Module 4: Scale, Reliability & Performance: Case Studies

These case studies make reliability measurable: SLOs, error budgets, latency percentiles, overload controls, incident response, and capacity limits.


Case Study 1: Error Budget Stops Feature Work

Scenario: A team has a 99.9% quarterly availability SLO but keeps shipping risky changes after burning the whole error budget in two weeks.

Source anchor: Google SRE's Error Budget Policy for Service Reliability, which explains how error budget policy balances reliability and release velocity.

Module concepts: SLI, SLO, error budget, release policy.

Wrong Approach

SLOs are dashboards nobody uses for decisions.

Better Approach

Connect budget to policy:

If budget healthy:
normal release velocity

If budget burn high:
freeze risky launches
prioritize reliability work

If budget exhausted:
leadership review before launches

Tradeoff Table

ChoiceGainCost
no SLOflexibleno reliability contract
SLO without policyvisible numbersno behavior change
budget policybalances speed/reliabilityuncomfortable prioritization
overly strict SLOhigh reliabilityslow delivery and high cost

Required Artifact

Write an SLO policy: SLI, target, window, budget, burn triggers, and release consequences.


Case Study 2: Average Latency Hides A Tail Outage

Scenario: Average latency is 120 ms. p99 is 4 seconds because a small percentage of requests block on a slow dependency.

Source anchor: Google SRE's Monitoring Distributed Systems, which names the four golden signals: latency, traffic, errors, and saturation.

Module concepts: percentiles, golden signals, RED/USE, dependency latency.

Wrong Approach

Report average latency as the performance story.

Better Approach

Track distributions and segment by route/dependency:

latency:
p50, p90, p95, p99

traffic:
requests/sec

errors:
rate by class

saturation:
queue depth, CPU, connection pool

Tradeoff Table

MetricGainRisk
averagesimplehides tail
p95/p99user pain visiblenoisier
per-routeactionablemore cardinality
dependency timingfinds bottleneckinstrumentation work

Required Artifact

Create a dashboard spec with golden signals, route segmentation, and alert threshold.


Case Study 3: Retry Storm Under Overload

Scenario: A dependency slows down. Callers retry immediately and multiply load until the system collapses.

Source anchor: Amazon Builders' Library Timeouts, retries, and backoff with jitter explains retries, overload, backoff, and jitter.

Module concepts: overload, retry amplification, backoff, jitter, load shedding.

Wrong Approach

Retry all failures immediately.

Better Approach

Set a retry budget:

timeout:
based on caller deadline

retries:
max 2 attempts
exponential backoff
jitter

overload:
load shed or degrade

Tradeoff Table

ChoiceGainCost
no retriessimpletransient failures leak
immediate retriesquick for rare blipsoverload amplification
backoff + jittersmooths loadslower recovery
load sheddingprotects corerejected requests

Required Artifact

Write timeout/retry/load-shed policy for one dependency.


Case Study 4: Blameless Postmortem That Produces Real Work

Scenario: An incident review spends 45 minutes asking who deployed the change and 5 minutes asking why guardrails failed.

Source anchor: Atlassian's How to run a blameless postmortem, which frames postmortems around learning and improvement without fear.

Module concepts: incident lifecycle, blamelessness, corrective action, learning system.

Wrong Approach

Find the person who made the mistake.

Better Approach

Analyze the system:

Detection:
why did alert fire late?

Mitigation:
why was rollback slow?

Prevention:
what guardrail would catch this class?

Learning:
owner, due date, verification

Tradeoff Table

ApproachGainCost
blameemotional closurehides real causes
blameless reviewlearningrequires discipline
action itemsimprovementfollow-up ownership
no reviewsaves timerepeat incidents

Required Artifact

Write a postmortem with timeline, contributing factors, impact, what went well, what went poorly, and verified follow-ups.


Case Study 5: Capacity Plan That Ignores Saturation

Scenario: A service handles 1,000 req/s on one instance. The team assumes ten instances handle 10,000 req/s. At 6,000 req/s, database connections and queueing dominate.

Source anchor: Neil Gunther's Universal Scalability Law is a standard model for scalability limits from contention and coherency. See Universal Scalability Law resources.

Module concepts: capacity planning, saturation, contention, queueing, horizontal scaling limit.

Wrong Approach

Assume throughput scales linearly with instances.

Better Approach

Measure bottlenecks:

per instance capacity:
shared DB bottleneck:
connection pool:
queue depth:
cache hit ratio:
peak/headroom:

Tradeoff Table

ChoiceGainCost
add instancesfast app scalingshifts bottleneck
shard DBmore write capacityoperational complexity
cachereduces backend loadinvalidation/staleness
admission controlprotects latencyrejects excess

Required Artifact

Create a capacity worksheet with current load, peak multiplier, bottleneck, headroom, and scale trigger.


Source Map

SourceUse it for
Google SRE: Error Budget Policyconnecting SLOs to release policy
Google SRE: Monitoring Distributed Systemsfour golden signals and monitoring discipline
AWS Builders' Library: Timeouts and retriesbackoff, jitter, overload prevention
Atlassian: Blameless postmortemsincident review practice
Universal Scalability Lawcontention/coherency limits on scaling

Completion Standard

  • At least three artifacts are completed.
  • At least one SLO policy includes budget consequences.
  • At least one dashboard uses percentiles.
  • At least one incident review has verified follow-ups.