Book Exercise Lanes

This module's exercise system is book-driven. Use these local chunks for targeted volume after you have already learned the concept from the guide.

How To Use This Page

Finish the relevant concept page first.
Solve at least one problem of your own from memory.
Only then open the matching exercise lane.
Keep a mistake log with tags such as averaged percentiles, missing timeout, no idempotency key, SLO without burn-rate alert, cache stampede, sticky session assumed, mitigation confused with resolution, blameful postmortem.

Lane 1: Performance Reasoning

Use this lane when percentile reasoning, USE/RED dashboards, or scalability-law math are still effortful.

External calibration:

Brendan Gregg: The USE Method - memorize the checklist for at least two resource classes.
The Tail at Scale (Dean, Barroso) - read section 2 and reproduce the fan-out tail formula.

Target outcomes:

3 percentile drills on sample distributions (compute p50/p95/p99, explain to a peer why averages do not tell you p99).
1 USE dashboard design and 1 RED dashboard design for two different real services you have access to.
1 Amdahl's-Law + USL calculation worksheet for a proposed scale-out, with peak N identified.
1 written refutation (in your own words) of a teammate's "our average is fine" claim, using real production data.

Lane 2: Scaling Strategies

Use this lane when you can quote definitions but struggle to defend a specific scaling decision under pushback.

Target outcomes:

1 end-to-end stateful-component audit of a real application (every hop labeled stateless/sticky/external/store).
1 three-layer cache design (CDN + reverse proxy + app) for a specific product, with update pattern and invalidation plan for each layer.
1 thundering-herd avoidance plan for a hot key in a real cache.
1 written SWOT of sticky sessions vs external session store for a specific team's use case.

Lane 3: Reliability Engineering

Use this lane when SLOs feel abstract, error budgets feel theoretical, or "blast radius" is a slogan.

External ladders:

Target outcomes:

3 SLI/SLO docs for 3 different real APIs, each with explicit burn-rate alerts.
1 classification exercise: given 10 real postmortems, label each as cascading, correlated, or gray failure.
1 chaos-experiment proposal with a hypothesis, blast-radius controls, and abort conditions.
1 error-budget policy written for a real team: when does the team stop shipping features?

Lane 4: Capacity Planning and Load

Use this lane when Little's Law still requires a calculator or back-pressure feels handwavy.

External ladders:

Neil Gunther: Universal Scalability Law - estimate α and β from a real measurement.
SRE Workbook: Addressing Cascading Failures - the shedding, timeouts, retries chapter.

Target outcomes:

3 Little's-Law calculations on real or simulated queues, with L = λ × W compared to measurement.
1 rate-limiter design (token bucket or leaky bucket) with request path, storage, and degraded-mode plan.
1 capacity-plan worksheet: forecast 6 months of growth for a service, with instance counts at current, forecast-linear, and worst-case geometric growth.
1 back-pressure design for a worker pool: at what utilization do you shed, what do you return, how do you log?

Lane 5: Incident and Observability

Use this lane when you know the phases but your postmortems still come out blameful or your dashboards lie to you.

External ladders:

Target outcomes:

1 instrumentation audit: for one real service, list every metric, every structured log field, and every trace span. Identify gaps.
2 tabletop incident runs (from Kata 4): play IC in writing, name mitigation separately from resolution.
2 blameless postmortem rewrites of public incidents: take a published blamed narrative, write the systemic version.
1 action-item retrospective: pick a real team's last 10 postmortem action items; audit how many shipped.

Self-Curated Problem Set

Build a custom set with these minimums:

3 percentile drills on real data.
3 SLI/SLO designs for real APIs.
3 Little's-Law calculations on real queues.
3 cache designs for different read/write mixes (read-heavy, write-heavy, mixed).
2 rate-limiter designs for different constraints.
2 chaos experiments with full hypothesis + blast radius + abort.
2 tabletop incidents with written timelines.
2 blameless postmortem rewrites.

Completion Checklist

Completed at least one lane in full.
Logged at least 12 real mistakes and corrections.
Can compute any of L, λ, W from the other two without a calculator for small numbers.
Can draft an SLO doc for any real API in under 30 minutes.
Can name the dominant failure-mode category of an incident within a minute of hearing it.
Wrote at least one blameless postmortem for a real or simulated incident.
Designed at least one rate limiter end to end.

How To Use This Page​

Lane 1: Performance Reasoning​

Lane 2: Scaling Strategies​

Lane 3: Reliability Engineering​

Lane 4: Capacity Planning and Load​

Lane 5: Incident and Observability​

Self-Curated Problem Set​

Completion Checklist​

How To Use This Page

Lane 1: Performance Reasoning

Lane 2: Scaling Strategies

Lane 3: Reliability Engineering

Lane 4: Capacity Planning and Load

Lane 5: Incident and Observability

Self-Curated Problem Set

Completion Checklist