Skip to main content

Scale, Reliability, and Performance Katas

Focused, repeatable exercises to build fluency with this module's core skills. Complete each kata at least twice, ideally against different services or systems.

Kata 1: Write an SLI/SLO for a Real API

Time limit: 30 minutes per run
Goal: produce production-grade reliability targets

Setup: pick one real API you have access to (internal service, personal project, or a public one like a weather or news API). For it, produce:

  1. Three SLIs in strict good_events / total_events form:
    • Availability SLI: e.g., count(status < 500) / count(all requests) over 1m buckets.
    • Latency SLI: e.g., count(latency < 300ms AND status < 500) / count(all requests). Include the status filter or your SLI is wrong when the service fails fast.
    • Quality or freshness SLI specific to the API (cache hit rate threshold, staleness bound, correctness sample).
  2. An SLO for each SLI with a measurement window (rolling 28 or 30 days).
  3. Error budget in minutes (availability) and percentage (latency).
  4. Two burn-rate alerts: a fast alert (5% of monthly budget in 1h -> page) and a slow alert (10% of monthly budget in 6h -> ticket).
  5. A rollback/release-gating policy: what does the team do if the budget is exhausted?

Deliverables:

  • One-page SLO document (can be markdown).
  • The exact Prometheus/CloudWatch/Datadog-style expression for each SLI if possible.
  • Justification: why this latency threshold and not one twice as tight or loose?

Repeat until: you can produce the doc in under 30 minutes and defend each number.

Kata 2: Design a Rate Limiter

Time limit: 45 minutes per run
Goal: internalize admission control and load management

Setup: design a rate limiter for a public-facing API. Requirements:

  • 1000 req/s per authenticated user, 100 req/s per unauthenticated IP.
  • Burst allowance: users can spend up to 5000 requests in a short burst as long as long-run average stays under 1000 rps.
  • Distributed: the API runs on 20 instances across 3 regions. Limits must be global, not per-instance.
  • Degraded-mode behavior: if the central limiter is unreachable, fail open or closed? -- justify.

Deliverables:

  1. Choose the algorithm: token bucket, leaky bucket, fixed window, sliding window, or hybrid. Justify.
  2. Draw the request path with where the check happens (gateway vs service vs both).
  3. Specify the storage and consistency model (Redis with Lua, in-memory + sync, per-region + federation).
  4. Define the exact response on rejection: status code, headers (Retry-After, X-RateLimit-*), body.
  5. Identify the three most likely failure modes and the mitigation for each.
  6. Instrument: list the metrics and SLIs for the rate limiter itself. (A broken rate limiter is a silent incident.)

Repeat until: you have done this for a per-user limiter, a per-IP limiter, and a per-resource-cost limiter (where expensive requests cost more tokens than cheap ones).

Kata 3: Apply Little's Law to a Real Queue

Time limit: 20 minutes per run
Goal: concrete intuition for the latency/throughput/concurrency identity

Setup: pick any queue or thread pool you have production data on (a web server's request queue, a background job queue, a database connection pool). Gather:

  • Mean arrival rate λ (requests per second)
  • Mean time in system W (queue + service time, in seconds)
  • Mean items in system L (instantaneous queue depth + in-flight count)

Deliverables:

  1. Compute L = λ · W from λ and W and compare to measured L. How close is it?
  2. If they disagree by more than 10%, identify the reason (non-steady-state, batching, multiple classes of work).
  3. Now imagine traffic doubles (λ -> 2λ). Keeping W constant, what is the new L? Is that feasible with current capacity?
  4. Alternatively, keeping L capacity fixed (max concurrency), what is the new W? At what rate does W become unacceptable (pick the SLO threshold)?
  5. Walk to an instance of back-pressure: at what λ does the queue start rejecting?
  6. Sketch the utilization-latency curve (Kingman's M/M/1 approximation W = 1 / (μ − λ)) for μ = 100 rps and λ = 50, 80, 90, 95, 99. Note the inflection point.

Repeat until: you can eyeball any reported λ and W and immediately know whether the stated L is plausible.

Kata 4: Run a Tabletop Incident

Time limit: 45 minutes per run
Goal: drill the incident lifecycle without production risk

Setup: pick a scenario (one of):

  • Checkout p99 latency crosses SLO threshold at peak traffic; no recent deploy.
  • All writes to one region are failing with 503 at 3am; reads are fine.
  • Error budget is 90% consumed and it's day 10 of a 30-day window.
  • A customer reports a bug you can reproduce, but the service's dashboards all show green.

Roles (act them out or play them yourself in writing):

  • Incident Commander (IC)
  • On-call engineer
  • Support liaison
  • Second service's on-call (downstream or upstream dependency)

Deliverables (as a written timeline with timestamps):

  1. Detect: what alert fired or signal arrived? How confident are you in the signal?
  2. Triage: severity, impact estimate, who to page. IC declares and names the channel.
  3. Mitigate: the first action. Justify why it is mitigation (user-facing effect stops) rather than resolution. Name the side effects.
  4. Continue mitigation for at least two more steps if the first doesn't fully resolve.
  5. Communicate: the exact text of the status update you post internally and (if any) externally.
  6. Resolve: the path from mitigated to root-cause fixed.
  7. Review: draft the postmortem's timeline, impact, causes, and three action items (specific, owned, dated).

Bonus: write the two-level burn-rate alert that would have caught this scenario 5 minutes earlier.

Repeat until: you can move cleanly between phases without collapsing "mitigate" and "resolve," you never page someone who can't help, and your postmortem drafts are blameless by default.

Completion Standard

  • Ran Kata 1 for at least two different APIs. Each produced a defensible SLO doc.
  • Ran Kata 2 with at least two different rate-limit models. You know when to pick which.
  • Ran Kata 3 for at least two different queues and can reason from any two of L, λ, W to the third.
  • Ran Kata 4 for at least two scenarios. You can play IC in writing without conflating phases.
  • You can explain each kata's core technique in one sentence.