Build Your Own Database (Key-Value / Redis-like)
"A database is a careful arrangement of files." -- Edmund Burke (paraphrased)
A key-value store is the perfect first database project. The data model is trivial (put(k, v), get(k), delete(k)), but the implementation decisions -- log-structured vs B-tree, write-ahead logging, compaction, indexing -- are the actual database engineering. Build one and you'll never read a Postgres or Redis blog post the same way again.
1. Overview & motivation
Two dominant storage engine designs:
- Log-Structured Merge tree (LSM) -- append-only writes, periodic compaction. Used by LevelDB, RocksDB, Cassandra, Scylla.
- B+ tree -- in-place updates, balanced tree pages. Used by SQLite, Postgres, MySQL (InnoDB), BoltDB.
A simpler variant in tutorial form is the Bitcask model: append-only log, in-memory hash index. Used by Riak. Most readable starting point.
What you can only learn by building one:
- Why writes are easy (append) but reads through an LSM are surprisingly complex (search active memtable + N SSTables in order).
- Why compaction is the heart of any log-structured engine -- and the source of its operational complexity.
- Why B+ trees make range scans free but random writes expensive.
- Why write-ahead logging (WAL) is the answer to "what if we crash mid-write."
- Why caching (LRU, ARC) appears in every database.
2. Where this fits in the degree
- Phase: Architecture
- Semester: 6 (Databases and Distributed Systems)
- Modules deepened: Module 2 (storage engines & indexing) -- the project for this module. Module 4 (transactions & consistency) if you add WAL + atomicity.
Cross-phase relevance:
- Builds on the Memory Allocator tutorial -- allocators matter for buffer pools.
- Sets up the Database (relational) tutorial -- same storage, more on top.
- Foundation for the Search Engine tutorial -- inverted indexes use the same on-disk structures.
3. Prerequisites
- A systems language: Go, Rust, C++, or C.
- File I/O comfort.
- Hash tables, sorted maps, B-trees as algorithms (you don't need to have implemented them).
- Tolerance for thinking about crash scenarios: "what if we lose power right now?"
4. Theory & research
Required reading
- Martin Kleppmann, Designing Data-Intensive Applications, Chapter 3 -- "Storage and Retrieval." The single best 50 pages on storage engines. â start here.
- Pavel Emelyanov & Edward Z. Yang, "Let's Build a Simple Database" (cstack.github.io/db_tutorial/) -- sqlite clone in C, walks through file format, B-tree, cursors.
Strongly recommended
- Alex Petrov, Database Internals -- Part 1 (Storage Engines). Newer than DDIA, more storage-focused.
- James Smith, "Build Your Own Database from Scratch: From B+Tree to SQL in 3000 Lines" -- build-your-own.org/database/. Go. Complete B+ tree + SQL layer.
For Redis specifically
- Redis source code -- surprisingly readable. Start at
src/dict.c(hash table),src/t_string.c(string commands),src/server.c(event loop). - antirez (Salvatore Sanfilippo), "Redis design and implementation" -- multiple blog posts. The architect's notes.
For LSM trees
- The original LSM paper: O'Neil, Cheng, Gawlick, O'Neil (1996), "The Log-Structured Merge-Tree" -- short. Read once for context.
- LevelDB source code -- about 30K lines of C++. Mature implementation of an LSM with one-thread compaction.
- Mark Callaghan's blog -- comparative LSM engineering writing.
5. Curated tutorial list (from BYO-X)
The BYO-X catalog "Database" entries cover both KV and relational. Here are the KV-flavored ones:
- C++: Build Your Own Redis from Scratch -- build-your-own.org/redis -- comprehensive book/site, Go â recommended primary for Redis-like
- C#: Build Your Own Database
- Clojure: An Archaeology-Inspired Database
- Crystal: Why you should build your own NoSQL Database
- Go: Build Your Own Database from Scratch: From B+Tree To SQL in 3000 Lines -- build-your-own.org/database â recommended primary for B+ tree path
- Go: Code a database in 45 steps: a series of test-driven small coding puzzles
- Go: Build Your Own Redis from Scratch
- JavaScript: Dagoba: an in-memory graph database -- different model (graph)
- Python: DBDB: Dog Bed Database -- from the 500 Lines or Less book â classic
- Python: Write your own miniature Redis with Python
- Ruby: Build your own fast, persistent KV store in Ruby
- Rust: Build your own Redis client and server
6. Recommended primary path
Three excellent paths; pick by interest:
Path A: Redis-like (network-server-shaped)
James Smith, "Build Your Own Redis" (build-your-own.org/redis). Free book. Builds a Redis-compatible server: TCP, event loop, command parsing, hash tables, sorted sets, AOF persistence, replication. ~2,000 lines of Go.
Choose this if you also want to learn server architecture (event loops, non-blocking I/O).
Path B: B+ tree storage engine
James Smith, "Build Your Own Database from Scratch" (build-your-own.org/database). Same author, focused on disk-resident B+ tree + SQL layer. 3,000 lines of Go.
Choose this if you want the storage engine internals.
Path C: Bitcask-style (LSM-precursor)
Read the Bitcask paper (Riak) -- 6 pages. Then implement it. Append-only log + in-memory hash index + compaction.
Smallest scope. Easy to finish in a week. Excellent for understanding why production engines became more complex.
For this degree: Path C -> Path B if you want depth; Path A if you want the Redis shape.
7. Implementation milestones (Bitcask-style, Path C)
Milestone 1: Append-only log with in-memory index
Writes append a record to a log file. The in-memory index maps key -> (file_offset, length). Reads seek + read.
class DB:
def __init__(self, path):
self.path = path
self.index = {} # key -> (offset, length)
self.fd = open(path, 'a+b')
self._rebuild_index()
def put(self, key, value):
offset = self.fd.tell()
record = encode(key, value)
self.fd.write(record)
self.index[key] = (offset, len(record))
def get(self, key):
if key not in self.index: return None
offset, length = self.index[key]
self.fd.seek(offset)
record = self.fd.read(length)
return decode(record).value
def delete(self, key):
if key in self.index:
del self.index[key]
self.put(key, TOMBSTONE)
Evidence: 100k puts and gets work. After restart, get still returns the right values (index rebuilt on startup).
Milestone 2: Crash safety with checksums
Each record has a CRC32. On reading, verify the CRC. If a partial record is at the tail (we crashed mid-write), truncate it.
record = checksum + key_size + value_size + key + value
Evidence: Kill the process during a write, restart, verify integrity. No corruption.
Milestone 3: Multiple log files + compaction
When the active log exceeds a threshold (say 1 MB), close it, open a new one. Run a background compaction: for each closed log, rewrite a smaller file containing only the latest version of each key (drop tombstoned/overwritten keys).
log-001 (closed, 1 MB) ─â”
log-002 (closed, 1 MB) ─┼─-> compact -> log-merged (200 KB)
log-003 (closed, 1 MB) ─┘
log-004 (active, 256 KB)
Evidence: After heavy update workload, on-disk size shrinks after compaction.
Milestone 4: Persistent index hint file
Rebuilding the index from logs on startup is O(records). For large DBs, this is slow. Periodically write a "hint" file: a snapshot of the in-memory index. Startup: load hint + replay any newer log entries.
Evidence: Startup time stays constant even as DB grows.
Milestone 5: Concurrent access
Single writer (mutex), many readers. Background compactor as a separate goroutine/thread.
Evidence: Multi-reader stress test: 8 reader threads, 1 writer thread, no data races (run under TSan).
Milestone 6 (optional): LSM tree progression
Once Bitcask is solid, the leap to LSM is:
- MemTable (in-memory sorted map for recent writes) replaces the simple hash.
- When MemTable is full, flush as an immutable SSTable on disk.
- SSTables are sorted, supporting range scans.
- Compaction merges N SSTables into 1 larger one.
- Reads check MemTable, then each SSTable from newest to oldest.
LevelDB and RocksDB do exactly this with multi-level storage.
Evidence: Range scan for k in db.range("user:", "user:~"): works correctly across multiple SSTables.
Milestone 7 (optional): Network server (Redis path)
Add TCP server with RESP (Redis serialization protocol) parsing. Now redis-cli connects to your server.
RESP commands:
*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n ↠SET foo bar
+OK\r\n ↠response
Evidence: redis-cli -p YOUR_PORT SET foo bar; GET foo returns bar.
8. Tests & evidence
| Test | How |
|---|---|
| Basic put/get/delete | Trivially. Edge cases: empty value, large value, deleted key. |
| Persistence | Write, close DB, reopen, read. |
| Crash recovery | Kill mid-write; restart; verify no corruption. |
| Compaction correctness | Pre-compact and post-compact reads return identical values. |
| Range scan (LSM) | Multi-SSTable range scan returns in sorted order. |
| Performance | Random write throughput; random read throughput; range scan throughput. Plot vs DB size. |
| Stress | 1M writes, then 10M reads at random keys, with crashes injected. |
The strongest evidence: a benchmark plot comparing your engine to LevelDB or BoltDB on the same workload. You'll be slower; that's fine. Document by how much and why.
9. Common pitfalls
fsyncconfusion.write()returns when the OS buffer has the data, not when it's on disk. Withoutfsyncon commit, a power loss loses recent writes. Decide: durability per commit (slow but safe) or batched (fast but lossy).- Index integrity. If your in-memory index gets out of sync with the log, reads return wrong values. Always update the log first, then the index.
- Tombstones in compaction. A tombstone means "this key was deleted." After compaction, tombstones must be retained until all older versions have also been compacted away.
- Reading from a being-written file. Two file handles (one for writing, one for reading) is cleanest.
- No CRC. Without checksums, a corrupted byte silently returns wrong data forever.
- Single-file design with many keys. Performance plateaus once the file grows huge. Real engines split into many files.
- Treating compaction as optional. Without it, on-disk size grows without bound.
- Locking the world during compaction. Compaction must run concurrently with reads and writes. This is the operational tension that defines every LSM engine.
10. Extensions
- Snapshots. Point-in-time consistent reads, useful for backups.
- TTL (expiration). Records with an expiration timestamp.
- Compression. Snappy, LZ4, or zstd on SSTables.
- Bloom filters. For each SSTable, precompute a bloom filter so most "key not present" reads skip the file.
- Range scans + bounded iterators. Standard for any LSM.
- Transactions. Atomic batches of writes (write group commit).
- Replication. A simple primary-replica setup. Leads naturally to the Kafka-like tutorial.
11. Module integration
| Module | What the KV store deepens |
|---|---|
| Sem 6 Module 2 -- Storage engines & indexing | The whole project. |
| Sem 6 Module 4 -- Transactions & consistency | Adding WAL + atomic batches turns this into a transactional store. |
| Memory Allocator tutorial | Buffer pools and slab allocation matter for performance. |
| Search Engine tutorial | Inverted indexes use SSTable-like sorted on-disk structures. |
| Database (relational) tutorial | Same storage engine; add SQL parsing and a query planner on top. |
| Sem 6 Module 3 -- Replication | After single-node works, add a follower. Step toward Kafka-like. |
12. Portfolio framing
What to publish:
- Source organized as
engine/,wal/,compaction/,server/(if Redis path). - A benchmark report comparing your engine to LevelDB/BoltDB/Redis on the same workload.
- A README with:
- Design decisions (LSM vs B-tree vs Bitcask).
- The crash-recovery test transcript.
- Limitations (no replication, single writer, etc.).
- A
docs/architecture.mdwith on-disk format diagrams.
Reviewer entry points:
engine/put.go-- the write path.engine/get.go-- the read path (more complex than you'd think after compaction).compaction/merge.go-- the operationally-critical code.bench/results.md-- the comparison.
A working storage engine is one of the strongest portfolio pieces in a systems/architecture engineer's collection. "I implemented an LSM-like KV store that handles 100k writes/sec on this benchmark" is concrete and reads well.
13. Local source backbone
Use these local chunks to turn the KV/Redis path into a full storage-and-server progression:
- Build Your Own Database From Scratch (
build-your-own/database-james-smith) - Build Your Own Database From Scratch in Go, 2E (
build-your-own/database-go-2e-james-smith) - Build Your Own Redis with C/C++ (
build-your-own/redis-james-smith)
| Local chunks | Use them for | Add to this project |
|---|---|---|
Database 001-002 | Files versus databases, atomic rename, durable files, LSM overview | Add a crash-safety lab before any tree work. |
Database 003-005 | B-tree/B+ tree intuition, node encoding, insertion, deletion | Add page-layout diagrams and invariant tests. |
Database 006-008 | Persisting pages, master page, allocation, free list | Add a page allocator and free-list inspection tool. |
Database 009-011 | Point queries, range queries, secondary indexes | Add range-scan and secondary-index milestones. |
Database 012-014 | Transactions, concurrent transactions, SQL parser bridge | Treat as the handoff into the relational database tutorial. |
Redis 001-010 | TCP byte stream, sockets, non-blocking I/O, poll/event loop, batching | Add a Redis-compatible network front end after the storage engine works. |
Redis 011-020 | Command processing, response serialization, hash tables, resizing, binary trees, sorted sets | Add RESP command tests and in-memory data-structure benchmarks. |
Redis 021-027 | Non-blocking I/O, timers, TTL commands, thread pool/semaphore work | Add TTL expiry, background jobs, and event-loop latency measurements. |
Extra checkpoints from the book chunks
- Durability checkpoint: kill the process during append, page write, and compaction; document recovery behavior.
- Page checkpoint: print one on-disk page as decoded fields, not just bytes.
- Protocol checkpoint: support a tiny RESP subset and verify it with a real Redis client where practical.
- Latency checkpoint: measure p50/p95 command latency before and after TTL/background work.
14. Deep project spec
Project contract
Build a key-value storage engine with a documented persistence model. The minimum contract is put, get, delete, range scan if using ordered storage, crash recovery, and a benchmarkable workload. The project must state whether it is append-log, LSM-like, B+ tree/page-based, or in-memory with persistence.
Source-backed reading map
| Source ID | Use for | Required output |
|---|---|---|
build-your-own/database-james-smith | page layout, B+ tree, persistence, free list, indexes | storage format, decoded page dump, recovery tests |
build-your-own/database-go-2e-james-smith | Go implementation path and SQL bridge | implementation checkpoints and range/index tests |
build-your-own/redis-james-smith | in-memory comparison and server extension | optional network front end and latency report |
Milestone map
| Milestone | Deliverable | Tests | Failure case |
|---|---|---|---|
| Storage contract | data model and file format | spec checklist | unsupported key/value size |
| Write path | durable put/delete | append/page write fixtures | torn or partial write |
| Read path | get and tombstone handling | model comparison | deleted key reappears |
| Index structure | B+ tree, hash index, or LSM memtable/SSTable | invariant tests | split/merge corruption |
| Range scan | ordered iteration if supported | sorted output fixtures | scan crosses page/table boundary |
| Recovery | restart and repair behavior | kill/restart tests | corrupted tail/page detected |
| Compaction/free list | reclaim space | before/after size and correctness | live key lost during compaction |
Test matrix
| Test type | Required examples |
|---|---|
| Unit | encoding, page/node invariants, comparator behavior |
| Model | randomized operations compared to an in-memory map |
| Crash | kill during write, rename, compaction, or page flush |
| Golden | decoded on-disk page/log fixture |
| Benchmark | read-heavy, write-heavy, mixed workload with p50/p95 |
Design notes required
format.md: file/page/log layout and versioning.recovery.md: what is durable, what may be lost, and how corruption is handled.index-choice.md: why this structure fits the target workload.benchmark.md: workload, data size, machine, and interpretation of results.
Portfolio evidence
Publish the storage contract, decoded page/log examples, crash-recovery transcript, randomized model-test summary, and one benchmark table with limitations.
Source
This tutorial draws from the BYO-X catalog "Database" section. DDIA Chapter 3 and James Smith's "Build Your Own Redis" / "Build Your Own Database" are the canonical primary sources.