Scale, Reliability, and Performance Katas
Focused, repeatable exercises to build fluency with this module's core skills. Complete each kata at least twice, ideally against different services or systems.
Kata 1: Write an SLI/SLO for a Real API
Time limit: 30 minutes per run
Goal: produce production-grade reliability targets
Setup: pick one real API you have access to (internal service, personal project, or a public one like a weather or news API). For it, produce:
- Three SLIs in strict
good_events / total_eventsform:- Availability SLI: e.g.,
count(status < 500) / count(all requests)over1mbuckets. - Latency SLI: e.g.,
count(latency < 300ms AND status < 500) / count(all requests). Include the status filter or your SLI is wrong when the service fails fast. - Quality or freshness SLI specific to the API (cache hit rate threshold, staleness bound, correctness sample).
- Availability SLI: e.g.,
- An SLO for each SLI with a measurement window (rolling 28 or 30 days).
- Error budget in minutes (availability) and percentage (latency).
- Two burn-rate alerts: a fast alert (5% of monthly budget in 1h -> page) and a slow alert (10% of monthly budget in 6h -> ticket).
- A rollback/release-gating policy: what does the team do if the budget is exhausted?
Deliverables:
- One-page SLO document (can be markdown).
- The exact Prometheus/CloudWatch/Datadog-style expression for each SLI if possible.
- Justification: why this latency threshold and not one twice as tight or loose?
Repeat until: you can produce the doc in under 30 minutes and defend each number.
Kata 2: Design a Rate Limiter
Time limit: 45 minutes per run
Goal: internalize admission control and load management
Setup: design a rate limiter for a public-facing API. Requirements:
1000 req/sper authenticated user,100 req/sper unauthenticated IP.- Burst allowance: users can spend up to
5000 requestsin a short burst as long as long-run average stays under1000 rps. - Distributed: the API runs on
20 instancesacross3 regions. Limits must be global, not per-instance. - Degraded-mode behavior: if the central limiter is unreachable, fail open or closed? -- justify.
Deliverables:
- Choose the algorithm: token bucket, leaky bucket, fixed window, sliding window, or hybrid. Justify.
- Draw the request path with where the check happens (gateway vs service vs both).
- Specify the storage and consistency model (Redis with Lua, in-memory + sync, per-region + federation).
- Define the exact response on rejection: status code, headers (
Retry-After,X-RateLimit-*), body. - Identify the three most likely failure modes and the mitigation for each.
- Instrument: list the metrics and SLIs for the rate limiter itself. (A broken rate limiter is a silent incident.)
Repeat until: you have done this for a per-user limiter, a per-IP limiter, and a per-resource-cost limiter (where expensive requests cost more tokens than cheap ones).
Kata 3: Apply Little's Law to a Real Queue
Time limit: 20 minutes per run
Goal: concrete intuition for the latency/throughput/concurrency identity
Setup: pick any queue or thread pool you have production data on (a web server's request queue, a background job queue, a database connection pool). Gather:
- Mean arrival rate
λ(requests per second) - Mean time in system
W(queue + service time, in seconds) - Mean items in system
L(instantaneous queue depth + in-flight count)
Deliverables:
- Compute
L = λ · WfromλandWand compare to measuredL. How close is it? - If they disagree by more than
10%, identify the reason (non-steady-state, batching, multiple classes of work). - Now imagine traffic doubles (
λ -> 2λ). KeepingWconstant, what is the newL? Is that feasible with current capacity? - Alternatively, keeping
Lcapacity fixed (max concurrency), what is the newW? At what rate doesWbecome unacceptable (pick the SLO threshold)? - Walk to an instance of back-pressure: at what
λdoes the queue start rejecting? - Sketch the utilization-latency curve (Kingman's
M/M/1approximationW = 1 / (μ − λ)) forμ = 100 rpsandλ = 50, 80, 90, 95, 99. Note the inflection point.
Repeat until: you can eyeball any reported λ and W and immediately know whether the stated L is plausible.
Kata 4: Run a Tabletop Incident
Time limit: 45 minutes per run
Goal: drill the incident lifecycle without production risk
Setup: pick a scenario (one of):
- Checkout p99 latency crosses SLO threshold at peak traffic; no recent deploy.
- All writes to one region are failing with 503 at 3am; reads are fine.
- Error budget is 90% consumed and it's day 10 of a 30-day window.
- A customer reports a bug you can reproduce, but the service's dashboards all show green.
Roles (act them out or play them yourself in writing):
- Incident Commander (IC)
- On-call engineer
- Support liaison
- Second service's on-call (downstream or upstream dependency)
Deliverables (as a written timeline with timestamps):
- Detect: what alert fired or signal arrived? How confident are you in the signal?
- Triage: severity, impact estimate, who to page. IC declares and names the channel.
- Mitigate: the first action. Justify why it is mitigation (user-facing effect stops) rather than resolution. Name the side effects.
- Continue mitigation for at least two more steps if the first doesn't fully resolve.
- Communicate: the exact text of the status update you post internally and (if any) externally.
- Resolve: the path from mitigated to root-cause fixed.
- Review: draft the postmortem's timeline, impact, causes, and three action items (specific, owned, dated).
Bonus: write the two-level burn-rate alert that would have caught this scenario 5 minutes earlier.
Repeat until: you can move cleanly between phases without collapsing "mitigate" and "resolve," you never page someone who can't help, and your postmortem drafts are blameless by default.
Completion Standard
- Ran Kata 1 for at least two different APIs. Each produced a defensible SLO doc.
- Ran Kata 2 with at least two different rate-limit models. You know when to pick which.
- Ran Kata 3 for at least two different queues and can reason from any two of
L, λ, Wto the third. - Ran Kata 4 for at least two scenarios. You can play IC in writing without conflating phases.
- You can explain each kata's core technique in one sentence.