Skip to main content

Distributed Systems Code Katas

Focused, repeatable exercises to build fluency on the core distributed-systems patterns. Complete each kata at least twice, ideally in different languages or against different coordination services.

Kata 1: Simulate Lamport and Vector Clocks on a Small System

Time limit: 20 minutes per run
Goal: concrete intuition for logical time

Setup: implement a simulator in any language. Represent processes as objects with:

  • a local Lamport counter (integer)
  • a local vector clock (array of length N)
  • a method local_event() that increments both
  • a method send(receiver) that increments, tags the message with both timestamps, and delivers it (optionally with random delay)
  • a method receive(msg) that updates both according to the rules

Run this script:

  1. 3 processes P1, P2, P3.
  2. Random interleaving: each tick, a random process either does a local event, sends to a random peer, or delivers one queued message.
  3. After 50 ticks, print every event's Lamport timestamp, vector timestamp, and process id.
  4. List every concurrent pair (detected via vector clocks).
  5. Verify: for every a -> b in the trace, L(a) < L(b) and V(a) < V(b) componentwise.

Repeat until: you can write the update rules without looking them up and your verification never fails.

Kata 2: Implement a Toy Raft Leader Election

Time limit: 45-60 minutes per run
Goal: internalize the election protocol and its invariants

Setup: implement, in any language:

  • Node objects with id, currentTerm, votedFor, state (Follower/Candidate/Leader), log (may be empty for this kata).
  • RequestVote(term, candidateId, lastLogIndex, lastLogTerm) RPC.
  • A randomized election timeout per node (150-300ms).
  • A heartbeat from the leader every 50ms (empty AppendEntries).

Behavior requirements:

  1. A Follower with an expired election timer becomes a Candidate, increments currentTerm, votes for itself, sends RequestVote to all peers.
  2. On RequestVote receipt: if term < currentTerm, reject. If you haven't voted this term and the candidate's log is at least as up-to-date (last entry's term then index), grant the vote.
  3. On winning a majority, become Leader and start heartbeats.
  4. On seeing a higher term, revert to Follower.
  5. Test: inject a leader crash by stopping its heartbeats. Verify exactly one node becomes the new leader within a bounded time.

Bonus: simulate a network partition that splits the cluster 2-3; verify the 3-side elects a leader and the 2-side does not.

Repeat until: you can write the state machine from scratch and your test never sees two leaders in the same term.

Kata 3: Design an Idempotent HTTP API

Time limit: 30 minutes per run
Goal: apply the idempotency pattern end-to-end

Setup: sketch (on paper or in code) a POST /payments endpoint with the following behavior:

  1. Accepts Idempotency-Key: <uuid> header.
  2. Body: { "amount": N, "currency": "USD", "card_token": "..." }.
  3. On first call with a given key: process the charge, store (key, request_hash, response) with 24h TTL, return response.
  4. On duplicate call with same key and same request body: return the stored response without reprocessing.
  5. On duplicate call with same key but different body: return 422 Unprocessable Entity (key reuse is a client bug).
  6. On duplicate call where the first is still in flight: either return 409 Conflict or block until the first finishes.

Deliverables:

  1. The handler pseudo-code.
  2. The schema of the idempotency store (key, body hash, response, status, created_at).
  3. A failure-mode table: what happens if the store is unavailable? If the downstream charge API returns but our response is lost?
  4. An argument for why this guarantees effectively-once processing despite at-least-once delivery.

Repeat until: you cover at least four failure modes correctly and the handler is under 30 lines.

Kata 4: Analyze a Real Distributed Outage Postmortem

Time limit: 60 minutes per postmortem
Goal: recognize module concepts in the wild

Setup: pick one public postmortem from the list below (or equivalent). Read it once, then write a 1-2 page analysis answering each question.

Recommended postmortems:

  • Cloudflare October 2023 dashboard outage
  • AWS us-east-1 December 2021 outage (the "Kinesis" one) and/or the February 2017 S3 outage
  • GitHub October 21, 2018 major service interruption (network partition + data inconsistency)
  • GitLab January 31, 2017 database deletion incident
  • Stripe July 2019 outage (Raft/etcd related)

Questions:

  1. Which fallacy of distributed computing was violated? Quote the line from the postmortem.
  2. What failure model did the system's design assume? Was that assumption violated?
  3. Was time or clocks involved? How?
  4. Was a failure detector involved? Was it too aggressive, too slow, or correct-but-unhelpful?
  5. Was consensus or leader election involved? Did it behave as designed?
  6. If you were writing the runbook to prevent recurrence, which of the 15 concepts in this module would you cite?

Repeat until: you can spot a distributed-systems concept in any outage narrative without rereading the module.

Completion Standard

  • Ran Kata 1 at least twice and can reproduce the clock rules from memory.
  • Ran Kata 2 at least once and your election never produced two leaders in the same term.
  • Wrote Kata 3 in under 30 minutes with at least four failure modes covered.
  • Analyzed at least two postmortems in Kata 4.
  • You can explain each kata's core technique in one sentence.