Distributed Systems Code Katas
Focused, repeatable exercises to build fluency on the core distributed-systems patterns. Complete each kata at least twice, ideally in different languages or against different coordination services.
Kata 1: Simulate Lamport and Vector Clocks on a Small System
Time limit: 20 minutes per run
Goal: concrete intuition for logical time
Setup: implement a simulator in any language. Represent processes as objects with:
- a local Lamport counter (integer)
- a local vector clock (array of length N)
- a method
local_event()that increments both - a method
send(receiver)that increments, tags the message with both timestamps, and delivers it (optionally with random delay) - a method
receive(msg)that updates both according to the rules
Run this script:
- 3 processes P1, P2, P3.
- Random interleaving: each tick, a random process either does a local event, sends to a random peer, or delivers one queued message.
- After 50 ticks, print every event's Lamport timestamp, vector timestamp, and process id.
- List every concurrent pair (detected via vector clocks).
- Verify: for every
a -> bin the trace,L(a) < L(b)andV(a) < V(b)componentwise.
Repeat until: you can write the update rules without looking them up and your verification never fails.
Kata 2: Implement a Toy Raft Leader Election
Time limit: 45-60 minutes per run
Goal: internalize the election protocol and its invariants
Setup: implement, in any language:
Nodeobjects withid,currentTerm,votedFor,state(Follower/Candidate/Leader),log(may be empty for this kata).RequestVote(term, candidateId, lastLogIndex, lastLogTerm)RPC.- A randomized election timeout per node (150-300ms).
- A heartbeat from the leader every 50ms (empty AppendEntries).
Behavior requirements:
- A Follower with an expired election timer becomes a Candidate, increments
currentTerm, votes for itself, sends RequestVote to all peers. - On RequestVote receipt: if
term < currentTerm, reject. If you haven't voted this term and the candidate's log is at least as up-to-date (last entry's term then index), grant the vote. - On winning a majority, become Leader and start heartbeats.
- On seeing a higher term, revert to Follower.
- Test: inject a leader crash by stopping its heartbeats. Verify exactly one node becomes the new leader within a bounded time.
Bonus: simulate a network partition that splits the cluster 2-3; verify the 3-side elects a leader and the 2-side does not.
Repeat until: you can write the state machine from scratch and your test never sees two leaders in the same term.
Kata 3: Design an Idempotent HTTP API
Time limit: 30 minutes per run
Goal: apply the idempotency pattern end-to-end
Setup: sketch (on paper or in code) a POST /payments endpoint with the following behavior:
- Accepts
Idempotency-Key: <uuid>header. - Body:
{ "amount": N, "currency": "USD", "card_token": "..." }. - On first call with a given key: process the charge, store
(key, request_hash, response)with 24h TTL, return response. - On duplicate call with same key and same request body: return the stored response without reprocessing.
- On duplicate call with same key but different body: return 422 Unprocessable Entity (key reuse is a client bug).
- On duplicate call where the first is still in flight: either return 409 Conflict or block until the first finishes.
Deliverables:
- The handler pseudo-code.
- The schema of the idempotency store (key, body hash, response, status, created_at).
- A failure-mode table: what happens if the store is unavailable? If the downstream charge API returns but our response is lost?
- An argument for why this guarantees effectively-once processing despite at-least-once delivery.
Repeat until: you cover at least four failure modes correctly and the handler is under 30 lines.
Kata 4: Analyze a Real Distributed Outage Postmortem
Time limit: 60 minutes per postmortem
Goal: recognize module concepts in the wild
Setup: pick one public postmortem from the list below (or equivalent). Read it once, then write a 1-2 page analysis answering each question.
Recommended postmortems:
- Cloudflare October 2023 dashboard outage
- AWS us-east-1 December 2021 outage (the "Kinesis" one) and/or the February 2017 S3 outage
- GitHub October 21, 2018 major service interruption (network partition + data inconsistency)
- GitLab January 31, 2017 database deletion incident
- Stripe July 2019 outage (Raft/etcd related)
Questions:
- Which fallacy of distributed computing was violated? Quote the line from the postmortem.
- What failure model did the system's design assume? Was that assumption violated?
- Was time or clocks involved? How?
- Was a failure detector involved? Was it too aggressive, too slow, or correct-but-unhelpful?
- Was consensus or leader election involved? Did it behave as designed?
- If you were writing the runbook to prevent recurrence, which of the 15 concepts in this module would you cite?
Repeat until: you can spot a distributed-systems concept in any outage narrative without rereading the module.
Completion Standard
- Ran Kata 1 at least twice and can reproduce the clock rules from memory.
- Ran Kata 2 at least once and your election never produced two leaders in the same term.
- Wrote Kata 3 in under 30 minutes with at least four failure modes covered.
- Analyzed at least two postmortems in Kata 4.
- You can explain each kata's core technique in one sentence.