Code Katas

Focused, repeatable drills that build operational fluency in replication and partitioning. These are not greenfield projects; each kata is a small simulation or analysis you can finish in under an hour and repeat until automatic.

Tooling expectations: a scripting language you are comfortable with (Python is easiest), a local PostgreSQL you can restart, and access to at least one Jepsen report. No distributed cluster required; the simulations are on paper and in process.

Kata 1: Simulate a Single-Leader Failover

Time limit: 45 minutes Goal: Walk a leader failover start-to-finish from the perspective of the leader, the replicas, the client, and the coordinator. Identify every data-loss window and every split-brain risk.

Setup. Spin up a local PostgreSQL with one primary and two streaming replicas (async or use Docker Compose with postgres:16). Or simulate on paper with a table of (time, leader, replica1.LSN, replica2.LSN, writes_accepted, client_visible_state).

Drill:

Start writes at 1000 TPS to the primary.
At t=10s, kill -9 the primary.
Record:
- How long until the failure was detected?
- Which replica had the higher LSN at kill time?
- If you promote that replica, how many writes are lost?
Configure Patroni (or your orchestrator) with a quorum lock in etcd. Repeat. Prove that the old primary cannot accept writes if it comes back later.
Configure semi-synchronous replication (synchronous_standby_names = 'replica1'). Repeat. Observe the data-loss window -- it should drop to zero if the sync replica survives.

Repeat until: you can run the simulation without referring to notes and can name the data-loss window under each of three configurations (async, semi-sync, sync) in one sentence each.

Kata 2: Diagnose a Replication-Lag Read Anomaly

Time limit: 30 minutes Goal: Take a bug report, identify the anomaly (Cluster 3.9), reproduce it locally, and prescribe the smallest fix that eliminates it.

Setup. Local Postgres with one primary and one replica. Add artificial lag via recovery_min_apply_delay = '5s' on the replica.

Drill:

Write a Python script that does: (a) INSERT a row on the primary, (b) SELECT that row from the replica. Observe: the replica returns empty 5s out of 6.
Name the anomaly ("read-your-writes violation").
Implement three fixes, each in a separate branch of the script:
- a. Read from primary after writes.
- b. After write, capture pg_current_wal_lsn(); pass it to the next read; reader issues SELECT pg_wal_replay_lsn() >= $lsn until true.
- c. Sticky-session the client to a specific replica (sticky-to-primary is the degenerate case).
For each fix, note: (a) does it fix the bug, (b) does it also fix monotonic-reads violations, (c) at what cost.

Repeat until: you can map each of the three guarantees (RYW, monotonic, consistent-prefix) to a concrete Postgres implementation detail from memory.

Kata 3: Design a Sharding Scheme for a Workload

Time limit: 45 minutes Goal: Take a workload you have not seen before, produce a one-page sharding design memo, and defend it.

Setup. Pick a workload card at random (or use the five in practice/02). Set a timer.

Drill:

Fill in the template:
- Partition key
- Scheme (range / hash / composite)
- Replication factor
- Consistency target
- Hotspot risk
- Rebalancing plan
- Failure-domain analysis (what does losing one rack cost?)
Draw the diagram: partitions, replicas, routing tier.
Identify at least two queries that will be slow under your scheme. Propose either (a) a secondary index, (b) a denormalization, (c) a separate system.
Hand the memo to a colleague (or a rubber duck). Ask them: "What breaks first?"

Repeat until: you can produce a credible sharding memo for an unfamiliar workload in under 30 minutes, including the slow-query identification.

Kata 4: Analyze a Jepsen-Style Report

Time limit: 60 minutes Goal: Read a real Jepsen report (jepsen.io) end-to-end and convert the finding into this module's vocabulary.

Setup. Pick one report from https://jepsen.io/analyses. Good starter options: MongoDB 3.4.0-rc3, Cassandra 4.0.0, Redis Raft, YugabyteDB.

Drill:

Read the report's abstract and findings section.
Answer in writing:
- What invariant did Jepsen test? (e.g., linearizable register, causal consistency, no-lost-writes)
- What replication topology and consistency level was the system running?
- What fault was injected (network partition, clock skew, process pause)?
- What was the observed violation?
- What was the root cause?
- What did the vendor fix?
Rewrite the finding in one paragraph using the vocabulary of Clusters 1-5: topology, log format, sync mode, replication lag, quorum, fencing.
Draw the failure timeline: writes, partitions, reads, and the observed violation.

Repeat until: you can read a new Jepsen report and, within 20 minutes, produce the paragraph-level summary in this module's language.

Completion Standard

Can run the failover simulation from memory and name the data-loss window under three replication modes.
Can reproduce a read-your-writes anomaly and apply at least two mechanically different fixes.
Can produce a sharding memo for a new workload in 30 minutes.
Can analyze a Jepsen report and produce a correct short summary in the module's vocabulary.
Have kept a mistake log of at least 5 recurring errors ("forgot to fence old leader", "assumed LWW was safe", "hash-partitioned on a monotonic key", "ignored minority side during partition", "confused conflict with stale read").

Kata 1: Simulate a Single-Leader Failover​

Kata 2: Diagnose a Replication-Lag Read Anomaly​

Kata 3: Design a Sharding Scheme for a Workload​

Kata 4: Analyze a Jepsen-Style Report​

Completion Standard​

Kata 1: Simulate a Single-Leader Failover

Kata 2: Diagnose a Replication-Lag Read Anomaly

Kata 3: Design a Sharding Scheme for a Workload

Kata 4: Analyze a Jepsen-Style Report

Completion Standard