Skip to main content

Code Katas

Focused, repeatable coding exercises to build fluency in storage-engine reasoning. Complete each kata multiple times until the setup feels automatic.

Kata 1: Simulate a B+-Tree Insert and Split

Time limit: 25 minutes
Goal: Implement an in-memory B+-tree supporting insert(key, value) and get(key) with a small order (say 4). Print the tree after every insert.
Setup: Model nodes as objects with keys[], children[] (internal) or values[] (leaf), and a linked-list pointer at leaves. Implement leaf splits and the separator promotion. Do not implement deletes.

Checklist:

  • After 100 random inserts, every leaf holds between ceil(m/2)-1 and m-1 entries (or you are the root).
  • All leaves sit at the same depth.
  • A range scan by iterating the leaf chain returns keys in sorted order.
  • Printed output clearly shows splits and separator promotion.

Repeat until: You can explain, without looking at code, when a split happens and how the separator is chosen.

Kata 2: Implement a Toy LSM Memtable + SSTable

Time limit: 45 minutes
Goal: Build a key-value store with a memtable and disk-resident sorted runs.
Setup:

  1. Memtable: a sorted dict (Python SortedDict, Rust BTreeMap, etc.) capped at N entries.
  2. On overflow, flush to a new SSTable file: write entries in sorted order as (key_len, key, value_len, value) records, plus an in-memory Bloom filter.
  3. get(key): check the memtable first, then SSTables newest-to-oldest, skipping any whose Bloom filter says No.
  4. Implement a minimal leveled-compaction step that merges two equal-sized SSTables into one.

Checklist:

  • Writes never block on disk I/O.
  • get returns the most recent value (including the case where an older SSTable has a stale version).
  • A delete/tombstone returns "not found" even if an older SSTable has the value.
  • The Bloom filter demonstrably skips at least one SSTable during a negative lookup.

Repeat until: You can explain, out loud, what makes each read slower and each write faster than a B+-tree.

Kata 3: Measure Index vs Full Scan Performance

Time limit: 30 minutes
Goal: Empirically verify that index lookups beat full scans at high selectivity and lose at low selectivity.
Setup:

  1. In PostgreSQL (or SQLite), create a table t(id INT PK, x INT, payload TEXT) with 10^6 rows. x is uniformly random in [0, 10^6).
  2. Run EXPLAIN ANALYZE SELECT * FROM t WHERE x = ? with one specific ? value. Record planning time, execution time, access path.
  3. Create an index on x. Re-run. Compare.
  4. Now run the range query SELECT * FROM t WHERE x BETWEEN ? AND ? for widening ranges: 0.01%, 0.1%, 1%, 10%, 50% of rows. Record which plan the optimizer picks.

Checklist:

  • You can name the exact selectivity at which the optimizer switches plans.
  • You can explain why without reading documentation.
  • You can identify the cost driver in each regime.

Repeat until: You predict the plan before running EXPLAIN.

Kata 4: Walk Through a Cost Estimate for a Join

Time limit: 20 minutes
Goal: Reproduce on paper what the optimizer does for a two-table join.
Setup: Given:

  • Table A: 10^7 rows, 8 KB pages, 50 rows per page.
  • Table B: 10^5 rows, 100 rows per page.
  • Predicate on A: selectivity 0.01. Predicate on B: selectivity 0.2.
  • Join: A.b_id = B.id, B.id is PK (clustered B+-tree, height 3).
  • Join memory budget: 64 MB. Hash-table overhead: 2x.

Compute, by hand, the estimated page I/Os for:

  1. Hash join with build on filtered B.
  2. Index nested loop with outer = filtered A, inner = B via PK index.
  3. Sort-merge join assuming both sides must be sorted.

Checklist:

  • You wrote each formula and plugged in numbers step by step.
  • You identified the dominant cost term in each plan.
  • You picked a winner and named the assumption most responsible for the choice.
  • You named at least one statistics mistake that would flip the decision.

Repeat until: You can do this mentally in under 3 minutes for a new scenario.

Completion Standard

  • Each kata implemented or walked through at least twice
  • Can explain every answer without rereading the concept page
  • Maintains a small kata log with at least 4 bugs or insights discovered per kata
  • Can describe, in sentences, how each kata connects to a real engine behavior (Postgres, RocksDB, InnoDB, SQLite)