Build Your Own Database (Relational / SQL)

"SQL is the most successful declarative programming language ever invented. Building a database is mostly building a SQL compiler that talks to a B-tree." -- roughly true

A relational database is the natural successor to the KV store tutorial. You keep the storage engine but add SQL parsing, a query planner, an executor, and (optionally) transactions with ACID guarantees. By the end you have a working SQL database that runs SELECT * FROM users WHERE age > 18.

1. Overview & motivation

A SQL database is a pipeline:

SQL text -> [parser] -> AST -> [analyzer] -> typed AST
        -> [planner] -> logical plan
        -> [optimizer] -> physical plan
        -> [executor] -> rows
                    â†‘
              [storage engine] (your KV store, or a B+ tree)

Each stage is a small but real piece of engineering.

What you can only learn by building one:

Why B+ trees on disk are the ubiquitous storage structure for relational databases.
Why query optimization has been an open research area for 50+ years.
Why transactions with isolation are unexpectedly subtle (read-committed, repeatable-read, serializable).
Why MVCC (multi-version concurrency control) is in nearly every production database.
Why most "scaling problems" with SQL are actually "we wrote bad SQL" -- and how to teach your database to compile slightly less bad SQL.

2. Where this fits in the degree

Phase: Architecture
Semester: 6 (Databases and Distributed Systems)
Modules deepened: Module 1 (relational databases & SQL), Module 2 (storage), Module 4 (transactions). This project spans most of Semester 6.

Cross-phase relevance:

Builds on the Database (KV) tutorial.
Builds on the Compiler tutorial -- SQL -> physical plan is compilation.

3. Prerequisites

Complete the Database (KV) tutorial first, or use SQLite's source as your storage substrate.
SQL fluency -- basic SELECT, JOIN, WHERE, GROUP BY.
A systems language: C, Go, Rust.
Parsing experience -- the Interpreter tutorial is excellent prep.

4. Theory & research

Required reading

Connor Stack, "Let's Build a Simple Database" (cstack.github.io/db_tutorial/) -- building a sqlite clone in C. 13 parts. Covers REPL, virtual machine, B-tree, cursors, indexes. â recommended primary.
James Smith, "Build Your Own Database from Scratch: From B+Tree to SQL in 3000 Lines" (build-your-own.org/database) -- Go. End-to-end. â alternative primary.

Strongly recommended

Alex Petrov, Database Internals -- entire book. Storage engines and distributed databases.
Hellerstein, Stonebraker, Hamilton, "Architecture of a Database System" (Foundations and Trends in Databases, 2007) -- 122-page survey paper by three database luminaries. The single best paper on what a database is. Free PDF.
Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book -- Stanford's textbook. Comprehensive.
Date, An Introduction to Database Systems -- older but unmatched on relational theory.

For specific topics

CMU 15-445/645 Database Systems (15445.courses.cs.cmu.edu) -- Andy Pavlo's course. Free recordings on YouTube. Goes deeper than most.
CMU 15-721 Advanced Database Systems -- modern columnar / in-memory.
SQLite source code -- the most-read C codebase. pager.c, btree.c, vdbe.c. Extraordinarily clean.
Postgres source code -- much larger. src/backend/optimizer/ is the famous query planner.

For transactions and concurrency

Bernstein, Hadzilacos, Goodman, Concurrency Control and Recovery in Database Systems (1987) -- free PDF. Old, comprehensive.
Berenson et al., "A Critique of ANSI SQL Isolation Levels" (SIGMOD 1995) -- the paper that named the anomalies you read about.

5. Curated tutorial list (from BYO-X)

C: Let's Build a Simple Database -- Connor Stack, cstack.github.io â recommended primary
C#: Build Your Own Database
Go: Build Your Own Database from Scratch: From B+Tree To SQL in 3000 Lines -- build-your-own.org â alternative primary
Go: Code a database in 45 steps: a series of test-driven small coding puzzles
Python: DBDB: Dog Bed Database -- from 500 Lines or Less. Excellent short chapter on a Python KV with B-tree.

(KV-focused entries are in the Database (KV) tutorial.)

6. Recommended primary path

Connor Stack, "Let's Build a Simple Database" (C, sqlite clone).

13 incremental parts. Each part adds one feature. The final code is ~1,000 lines of C and runs:

sqlite> insert 1 cstack foo@bar.com
sqlite> select
(1, cstack, foo@bar.com)

Parts cover: REPL, single-row insert, B-tree implementation, cursors, multi-row inserts with internal nodes, range scans.

For Go: James Smith's "Build Your Own Database" is the equivalent. Goes further (adds a real SQL parser).

If you've done the KV store and want to add the SQL layer on top, James Smith's path is the right next step.

7. Implementation milestones

Milestone 1: SQL REPL with hardcoded schema

INSERT and SELECT for one table with one fixed schema. Tokenize the input. Match against the two commands.

typedef struct { int id; char name[32]; char email[256]; } Row;

void execute_insert(Row *row) { /* append to in-memory table */ }
void execute_select() { /* print all rows */ }

Evidence: REPL accepts insert 1 alice alice@example.com and select, prints rows.

Milestone 2: Persistence to a single file

Replace the in-memory table with a file. Page-based: 4 KB pages, each containing N fixed-size rows.

A Pager abstraction wraps file I/O: get_page(n) returns a pointer to a 4 KB buffer, faulting in from disk if needed.

Evidence: Insert rows, close DB, reopen, select -- rows persist.

Milestone 3: B+ tree for indexed access

Replace the row-array with a B+ tree on disk. Each page is a B+ tree node (leaf or internal). Pages are 4 KB.

typedef enum { NODE_INTERNAL, NODE_LEAF } NodeType;
// Leaf node layout:
// [type|is_root|parent_page] [num_cells]
// [cell_0_key, cell_0_value]
// [cell_1_key, cell_1_value]
// ...

This is the conceptual high point. Read the cstack tutorial parts 8-11 carefully.

Evidence: Insert 1,000 random keys in random order; range scan returns sorted. Insert 100,000 keys; tree stays balanced.

Milestone 4: Splits and internal nodes

When a leaf is full, split into two. Promote the median key to the parent. Repeat if the parent splits.

Evidence: Insert 100,000 keys with deletions interleaved. Tree depth is O(log N).

Milestone 5: SQL parser

Tokenize, parse with recursive descent (or Pratt, if you've done the Interpreter tutorial). Support: CREATE TABLE, INSERT, SELECT, WHERE, simple expressions.

CREATE TABLE users (id INTEGER, name TEXT, age INTEGER);
INSERT INTO users VALUES (1, 'alice', 30);
SELECT * FROM users WHERE age > 25;

Evidence: Parse the above queries into ASTs. Pretty-print ASTs.

Milestone 6: Query planner & executor

Convert AST to a tree of operators (Scan, Filter, Project, Join). Execute by pulling rows through the tree (volcano model).

SELECT name FROM users WHERE age > 25

becomes:

    Project([name])
       â”‚
    Filter(age > 25)
       â”‚
    SeqScan(users)

Evidence: Run the SELECT above. Returns the expected rows.

Milestone 7: Index-aware planning

If the filter is on an indexed column (the primary key, in our B+ tree), use an index scan instead of a sequential scan.

SELECT * FROM users WHERE id = 42

becomes:

    IndexScan(users.id = 42)   â† O(log n)

Evidence: With 1M rows, indexed lookup is O(log n) (~microseconds); sequential scan is O(n) (~seconds). Measurable difference.

Milestone 8: Joins

Nested-loop join first (simplest, often optimal for small tables). Hash join for large equi-joins.

SELECT users.name, orders.amount 
FROM users 
JOIN orders ON users.id = orders.user_id;

Evidence: Correct results for various join scenarios.

Milestone 9: Aggregation

GROUP BY, COUNT, SUM, AVG. A hash aggregator: insert each row into a hash map keyed by GROUP BY columns.

Milestone 10 (optional, hard): Transactions

Add WAL (write-ahead log). On commit, fsync the log. On crash, replay WAL to recover.

Add MVCC: each row has a "version" -- a transaction ID. Readers see the snapshot of their transaction's start time. Writers create new versions; old versions are garbage-collected later.

This is where database engineering becomes real engineering. Plan for weeks.

8. Tests & evidence

Test	How
SQL parse	A test suite of valid and invalid SQL strings.
Insert/select correctness	Random data inserted, queries return expected rows.
Persistence	Insert, restart, query.
B-tree invariants	After N random inserts/deletes, tree balanced, all keys reachable.
Crash recovery	Kill mid-write, restart, check integrity.
Performance	Insert 1M rows; measure throughput. Index scan vs seq scan.
SQL interop	A real SQLite test query runs and returns the same rows on your DB.

The strongest evidence: ./mydb -f test.db runs an interactive SQL session where a reviewer can type queries and get correct answers.

9. Common pitfalls

Mixing B-tree concepts with B+ tree concepts. Use B+ tree (data only in leaves). It's simpler and supports range scans naturally.
Variable-length keys. Fixed-size first (e.g., int keys). Variable-length later.
Forgetting to mark pages dirty. A read from a page is free; a write must flush the page when evicted from the buffer pool.
Buffer pool eviction during mutation. If you evict a page that's mid-modification, corruption. Pin pages while writing.
Off-by-one in B-tree split point. Choose the right median. Test with hand-traced examples.
Naive SQL parsing. Operator precedence is real. Use a Pratt parser.
WAL ordering. The log entry must hit disk before the data page. Otherwise, on crash, your data is corrupted but the WAL says nothing.
MVCC garbage collection. Old versions accumulate. Need a background GC.

10. Extensions

Hash indexes. O(1) point lookup but no range scans. Useful for some workloads.
Cost-based optimizer. Maintain table statistics; estimate query costs; choose the cheapest plan.
Subqueries -- correlated and uncorrelated.
Window functions -- OVER (PARTITION BY ...). Hard.
Foreign keys with cascading deletes. Referential integrity.
Constraint checking. NOT NULL, UNIQUE, CHECK.
Online schema changes -- add/drop columns without downtime.
Replication. Async or sync. Step toward distributed DB.

11. Module integration

Module	What the relational DB deepens
Sem 6 Module 1 -- Relational & SQL	The whole project.
Sem 6 Module 2 -- Storage & indexing	B+ tree on disk. Buffer pool. Compaction (if you implement it).
Sem 6 Module 4 -- Transactions & consistency	If you implement WAL + MVCC.
KV store tutorial	The natural prerequisite.
Compiler tutorial	SQL -> plan is compilation. AST -> optimization -> execution is the same shape.

12. Portfolio framing

What to publish:

Source organized as parser/, planner/, executor/, storage/, wal/ (if implemented).
A SQL test suite (tests/sql/) with .sql files and expected outputs.
A README with:
- Supported SQL grammar (BNF or examples).
- Storage architecture diagram.
- Performance numbers (rows/sec for insert, query latency at various sizes).
- Honest list of limitations.

Reviewer entry points:

parser/parser.go -- the SQL parser.
planner/planner.go -- AST -> plan.
executor/scan.go -- operator implementations.
storage/btree.go -- the on-disk index.
README must include: example session transcript, supported SQL grammar, known limitations.

This is a flagship portfolio project. "I built a working SQL database with a B+ tree storage engine and query planner" is a portfolio anchor.

13. Local source backbone

Use Build Your Own Database From Scratch (build-your-own/database-james-smith) and the Go 2E chunk set (build-your-own/database-go-2e-james-smith) as the spine for the relational path once the KV engine is real.

Local chunks	Use them for	Add to this project
`001`-`002`	File persistence, atomic replacement, durable write patterns, and LSM context	Add an opening storage-design decision: append log, B+ tree, or hybrid.
`003`-`008`	B-tree practice, node encoding, deletion, page allocation, free list	Require a storage-engine milestone before SQL parsing.
`009`-`011`	Point query, range query, secondary indexes, index maintenance	Add table-scan versus index-scan evidence.
`012`-`014`	Transaction interfaces, concurrent transactions, implementation analysis, parser entry	Add a transaction API sketch before implementing SQL statements.
`015`-`016`	Grammar, expression parsing, statement parsing, executing statements	Add BNF, AST snapshots, and execution-plan snapshots to the SQL milestone.
`017`	Further steps	Use as the extension queue: optimizer, richer SQL, concurrency, and durability improvements.

Extra checkpoints from the book chunks

Storage-first checkpoint: implement insert/get/range on the table file before accepting SQL.
Grammar checkpoint: document supported SQL in BNF and keep invalid-query tests beside valid-query tests.
Plan checkpoint: print a logical plan and physical plan for at least three queries.
Isolation checkpoint: run two concurrent transaction scenarios and explain the anomaly allowed or prevented.

14. Deep project spec

Project contract

Build a small relational database with a documented SQL subset, table storage, indexes, logical plans, physical execution, and a clear transaction story. The first version may be single-process and single-writer, but it must say so explicitly.

Source-backed reading map

Source ID	Use for	Required output
`build-your-own/database-james-smith`	storage, B+ tree, transactions, SQL parser bridge	table file, index tests, transaction API sketch
`build-your-own/database-go-2e-james-smith`	implementation staging and query execution	SQL grammar, AST snapshots, plan snapshots

Milestone map

Milestone	Deliverable	Tests	Failure case
SQL subset	BNF and parser	valid/invalid SQL AST snapshots	syntax error with location
Table storage	row format and table file	insert/get fixtures	malformed row rejected
Index	primary-key B+ tree or equivalent	point/range lookup tests	duplicate key
Executor	scan, filter, projection	query output golden tests	unknown column
Planner	logical and physical plan dumps	plan snapshots	unsupported query has clear error
Transactions	begin/commit/rollback or explicit non-support	rollback/restart tests	partial transaction after crash
Concurrency	lock or single-writer contract	two-client scenario	anomaly documented

Test matrix

Test type	Required examples
Golden	SQL -> AST -> logical plan -> physical plan
Storage	row encoding, page split, index lookup
Differential	compare simple SELECT results to SQLite where subset overlaps
Negative	invalid grammar, unknown table/column, type mismatch
Transaction	commit, rollback, crash/restart if supported

Design notes required

sql-subset.md: supported statements, expressions, and omissions.
storage.md: row/page/index layout and constraints.
planner.md: logical operators, physical operators, and cost assumptions.
transactions.md: isolation level or explicit non-goal.

Portfolio evidence

Publish an example SQL session, AST/plan dumps for three queries, storage/index diagrams, transaction caveats, and a correctness comparison against SQLite for the supported subset.

Source

This tutorial draws from the BYO-X catalog "Database" section. Connor Stack's tutorial and James Smith's "Build Your Own Database" are the canonical primary references. CMU 15-445 is the canonical academic resource.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

For specific topics​

For transactions and concurrency​

5. Curated tutorial list (from BYO-X)​

6. Recommended primary path​

7. Implementation milestones​

Milestone 1: SQL REPL with hardcoded schema​

Milestone 2: Persistence to a single file​

Milestone 3: B+ tree for indexed access​

Milestone 4: Splits and internal nodes​

Milestone 5: SQL parser​

Milestone 6: Query planner & executor​

Milestone 7: Index-aware planning​

Milestone 8: Joins​

Milestone 9: Aggregation​

Milestone 10 (optional, hard): Transactions​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

13. Local source backbone​

Extra checkpoints from the book chunks​

14. Deep project spec​

Project contract​

Source-backed reading map​

Milestone map​

Test matrix​

Design notes required​

Portfolio evidence​

Source​