Book Exercise Lanes
This module's exercise system is book-driven. Use these local chunks for targeted volume after you have already learned the concept from the guide.
How To Use This Page
- Finish the relevant concept page first.
- Solve at least one problem of your own from memory.
- Only then open the matching exercise lane.
- Keep a mistake log with tags such as
averaged percentiles,missing timeout,no idempotency key,SLO without burn-rate alert,cache stampede,sticky session assumed,mitigation confused with resolution,blameful postmortem.
Lane 1: Performance Reasoning
Use this lane when percentile reasoning, USE/RED dashboards, or scalability-law math are still effortful.
- System Design Primer: Performance vs Scalability
- System Design Primer: Latency vs Throughput
- System Design Primer: Powers of Two and Latency Numbers
- FoSA: Cross-Cutting Architecture Characteristics
- FoSA: Measuring Architecture Characteristics
External calibration:
- Brendan Gregg: The USE Method - memorize the checklist for at least two resource classes.
- The Tail at Scale (Dean, Barroso) - read section 2 and reproduce the fan-out tail formula.
Target outcomes:
- 3 percentile drills on sample distributions (compute p50/p95/p99, explain to a peer why averages do not tell you p99).
- 1 USE dashboard design and 1 RED dashboard design for two different real services you have access to.
- 1 Amdahl's-Law + USL calculation worksheet for a proposed scale-out, with peak
Nidentified. - 1 written refutation (in your own words) of a teammate's "our average is fine" claim, using real production data.
Lane 2: Scaling Strategies
Use this lane when you can quote definitions but struggle to defend a specific scaling decision under pushback.
- System Design Primer: Performance vs Scalability
- System Design Primer: Load Balancer
- System Design Primer: Reverse Proxy
- System Design Primer: Application Layer and Microservices
- System Design Primer: Cache Overview and Levels
- System Design Primer: Cache Update Patterns
- System Design Primer: Content Delivery Network
- System Design Primer: Availability Patterns
Target outcomes:
- 1 end-to-end stateful-component audit of a real application (every hop labeled stateless/sticky/external/store).
- 1 three-layer cache design (CDN + reverse proxy + app) for a specific product, with update pattern and invalidation plan for each layer.
- 1 thundering-herd avoidance plan for a hot key in a real cache.
- 1 written SWOT of sticky sessions vs external session store for a specific team's use case.
Lane 3: Reliability Engineering
Use this lane when SLOs feel abstract, error budgets feel theoretical, or "blast radius" is a slogan.
- System Design Primer: Availability Patterns
- System Design Primer: CAP Theorem
- FoSA: Explicit Characteristics
- FoSA: Implicit Characteristics
- FoSA: Fitness Functions
External ladders:
- Google SRE Book: Service Level Objectives
- SRE Workbook: Implementing SLOs
- SRE Workbook: Alerting on SLOs
- Principles of Chaos Engineering
- Richard Cook: How Complex Systems Fail
Target outcomes:
- 3 SLI/SLO docs for 3 different real APIs, each with explicit burn-rate alerts.
- 1 classification exercise: given 10 real postmortems, label each as cascading, correlated, or gray failure.
- 1 chaos-experiment proposal with a hypothesis, blast-radius controls, and abort conditions.
- 1 error-budget policy written for a real team: when does the team stop shipping features?
Lane 4: Capacity Planning and Load
Use this lane when Little's Law still requires a calculator or back-pressure feels handwavy.
- System Design Primer: Latency vs Throughput
- System Design Primer: Asynchronism
- System Design Primer: Load Balancer
- System Design Primer: Availability Patterns
- System Design Primer: Powers of Two and Latency Numbers
- FoSA: Measuring Architecture Characteristics
External ladders:
- Neil Gunther: Universal Scalability Law - estimate
αandβfrom a real measurement. - SRE Workbook: Addressing Cascading Failures - the shedding, timeouts, retries chapter.
Target outcomes:
- 3 Little's-Law calculations on real or simulated queues, with
L = λ × Wcompared to measurement. - 1 rate-limiter design (token bucket or leaky bucket) with request path, storage, and degraded-mode plan.
- 1 capacity-plan worksheet: forecast 6 months of growth for a service, with instance counts at current, forecast-linear, and worst-case geometric growth.
- 1 back-pressure design for a worker pool: at what utilization do you shed, what do you return, how do you log?
Lane 5: Incident and Observability
Use this lane when you know the phases but your postmortems still come out blameful or your dashboards lie to you.
- FoSA: Operations and DevOps
- FoSA: Cross-Cutting Architecture Characteristics
- FoSA: Measuring Architecture Characteristics
- System Design Primer: Availability Patterns
External ladders:
- Google SRE Book: Monitoring Distributed Systems
- Google SRE Book: Managing Incidents
- Google SRE Book: Emergency Response
- Google SRE Book: Postmortem Culture
- PagerDuty Incident Response
Target outcomes:
- 1 instrumentation audit: for one real service, list every metric, every structured log field, and every trace span. Identify gaps.
- 2 tabletop incident runs (from Kata 4): play IC in writing, name mitigation separately from resolution.
- 2 blameless postmortem rewrites of public incidents: take a published blamed narrative, write the systemic version.
- 1 action-item retrospective: pick a real team's last 10 postmortem action items; audit how many shipped.
Self-Curated Problem Set
Build a custom set with these minimums:
- 3 percentile drills on real data.
- 3 SLI/SLO designs for real APIs.
- 3 Little's-Law calculations on real queues.
- 3 cache designs for different read/write mixes (read-heavy, write-heavy, mixed).
- 2 rate-limiter designs for different constraints.
- 2 chaos experiments with full hypothesis + blast radius + abort.
- 2 tabletop incidents with written timelines.
- 2 blameless postmortem rewrites.
Completion Checklist
- Completed at least one lane in full.
- Logged at least 12 real mistakes and corrections.
- Can compute any of
L, λ, Wfrom the other two without a calculator for small numbers. - Can draft an SLO doc for any real API in under 30 minutes.
- Can name the dominant failure-mode category of an incident within a minute of hearing it.
- Wrote at least one blameless postmortem for a real or simulated incident.
- Designed at least one rate limiter end to end.