Code Katas

Focused, repeatable coding exercises to build fluency in storage-engine reasoning. Complete each kata multiple times until the setup feels automatic.

Kata 1: Simulate a B+-Tree Insert and Split

Time limit: 25 minutes
Goal: Implement an in-memory B+-tree supporting insert(key, value) and get(key) with a small order (say 4). Print the tree after every insert.
Setup: Model nodes as objects with keys[], children[] (internal) or values[] (leaf), and a linked-list pointer at leaves. Implement leaf splits and the separator promotion. Do not implement deletes.

Checklist:

After 100 random inserts, every leaf holds between ceil(m/2)-1 and m-1 entries (or you are the root).
All leaves sit at the same depth.
A range scan by iterating the leaf chain returns keys in sorted order.
Printed output clearly shows splits and separator promotion.

Repeat until: You can explain, without looking at code, when a split happens and how the separator is chosen.

Kata 2: Implement a Toy LSM Memtable + SSTable

Time limit: 45 minutes
Goal: Build a key-value store with a memtable and disk-resident sorted runs.
Setup:

Memtable: a sorted dict (Python SortedDict, Rust BTreeMap, etc.) capped at N entries.
On overflow, flush to a new SSTable file: write entries in sorted order as (key_len, key, value_len, value) records, plus an in-memory Bloom filter.
get(key): check the memtable first, then SSTables newest-to-oldest, skipping any whose Bloom filter says No.
Implement a minimal leveled-compaction step that merges two equal-sized SSTables into one.

Checklist:

Writes never block on disk I/O.
get returns the most recent value (including the case where an older SSTable has a stale version).
A delete/tombstone returns "not found" even if an older SSTable has the value.
The Bloom filter demonstrably skips at least one SSTable during a negative lookup.

Repeat until: You can explain, out loud, what makes each read slower and each write faster than a B+-tree.

Kata 3: Measure Index vs Full Scan Performance

Time limit: 30 minutes
Goal: Empirically verify that index lookups beat full scans at high selectivity and lose at low selectivity.
Setup:

In PostgreSQL (or SQLite), create a table t(id INT PK, x INT, payload TEXT) with 10^6 rows. x is uniformly random in [0, 10^6).
Run EXPLAIN ANALYZE SELECT * FROM t WHERE x = ? with one specific ? value. Record planning time, execution time, access path.
Create an index on x. Re-run. Compare.
Now run the range query SELECT * FROM t WHERE x BETWEEN ? AND ? for widening ranges: 0.01%, 0.1%, 1%, 10%, 50% of rows. Record which plan the optimizer picks.

Checklist:

You can name the exact selectivity at which the optimizer switches plans.
You can explain why without reading documentation.
You can identify the cost driver in each regime.

Repeat until: You predict the plan before running EXPLAIN.

Kata 4: Walk Through a Cost Estimate for a Join

Time limit: 20 minutes
Goal: Reproduce on paper what the optimizer does for a two-table join.
Setup: Given:

Table A: 10^7 rows, 8 KB pages, 50 rows per page.
Table B: 10^5 rows, 100 rows per page.
Predicate on A: selectivity 0.01. Predicate on B: selectivity 0.2.
Join: A.b_id = B.id, B.id is PK (clustered B+-tree, height 3).
Join memory budget: 64 MB. Hash-table overhead: 2x.

Compute, by hand, the estimated page I/Os for:

Hash join with build on filtered B.
Index nested loop with outer = filtered A, inner = B via PK index.
Sort-merge join assuming both sides must be sorted.

Checklist:

You wrote each formula and plugged in numbers step by step.
You identified the dominant cost term in each plan.
You picked a winner and named the assumption most responsible for the choice.
You named at least one statistics mistake that would flip the decision.

Repeat until: You can do this mentally in under 3 minutes for a new scenario.

Completion Standard

Each kata implemented or walked through at least twice
Can explain every answer without rereading the concept page
Maintains a small kata log with at least 4 bugs or insights discovered per kata
Can describe, in sentences, how each kata connects to a real engine behavior (Postgres, RocksDB, InnoDB, SQLite)

Kata 1: Simulate a B+-Tree Insert and Split​

Kata 2: Implement a Toy LSM Memtable + SSTable​

Kata 3: Measure Index vs Full Scan Performance​

Kata 4: Walk Through a Cost Estimate for a Join​

Completion Standard​

Kata 1: Simulate a B+-Tree Insert and Split

Kata 2: Implement a Toy LSM Memtable + SSTable

Kata 3: Measure Index vs Full Scan Performance

Kata 4: Walk Through a Cost Estimate for a Join

Completion Standard