Module 2: Storage Engines & Indexing
Primary text: Database Internals (Petrov)
Selective support: Designing Data-Intensive Applications Chapter 3 for architectural framing, Database System Concepts for classical rigor on indexes, query processing, concurrency, and recovery; Distributed Systems: Concepts and Design only when storage crosses machine boundaries
This guide is the primary teacher. You do not need to read the source books front-to-back to complete this module. You do need to become operationally strong at disk-aware storage reasoning, index structure choice, query execution cost, and the concurrency primitives that sit inside a single node before Module 3 takes the database across machines.
Scope of This Module
This module is not a catalog of storage engine brand names. It is where database performance stops being a magic box and becomes something you can reason about in pages, seeks, and access paths.
What it covers in depth:
- why databases treat disk, SSD, and memory hierarchies differently, and how random vs sequential I/O dominates design
- how pages, records, heap files, and slotted pages arrange bytes so the buffer pool can move them around
- what the buffer pool actually does and why page replacement (LRU, clock, LRU-K) is not an afterthought
- B+-tree structure, invariants, splits, merges, and why it wins for mixed OLTP workloads
- point lookups, range scans, and the arithmetic of tree height vs fan-out
- clustered vs secondary indexes, covering indexes, and index-only scans
- LSM-trees: memtables, SSTables, compaction strategies, and read/write/space amplification tradeoffs
- Bloom filters as LSM read accelerators and where they buy you speed vs false-positive cost
- hash, bitmap, GIN, and GiST as specialty index structures and when they are justified
- query execution models: volcano/iterator, vectorized, and JIT-compiled pipelines
- join algorithms: nested loop, hash, sort-merge, and how the optimizer chooses between them
- statistics, histograms, selectivity estimation, and what a cost-based optimizer actually does
- lock-based concurrency control: 2PL, lock modes, escalation, deadlock
- MVCC and snapshot isolation as the concurrency story modern engines actually tell
- write-ahead logging (WAL) and the ARIES recovery model, including redo, undo, and checkpoints
What it deliberately does not try to finish here:
- replication, partitioning, and cross-node placement (Module 3)
- the full ACID/isolation spectrum and consistency theory (Module 4)
- distributed transactions, consensus, and global time (Module 5)
- hand-rolling a production storage engine
This is the implementation-depth foundation for the rest of Semester 6.
Before You Start
Answer these closed-book before starting the main path:
- Why does a database care whether an I/O is random or sequential if both return the same bytes?
- If a B+-tree has fan-out
250and holds10^9keys, roughly how tall is it, and why does that matter for latency? - What is the difference between an LSM-tree and a B-tree in terms of where new writes go?
- Why does a secondary index usually cost you an extra I/O compared to a clustered index?
- What does "the database runs out of memory" actually mean if the buffer pool is full?
Diagnostic Interpretation
4-5 solid answers
- You are ready for the full path.
2-3 solid answers
- Continue, but expect extra time in the B-tree cluster and the LSM cluster.
0-1 solid answers
- Revisit Module 1 (tables, rows, queries) and Semester 5 memory-hierarchy material before starting. Storage engine reasoning assumes both.
What This Module Is For
Storage engines are the layer where database claims meet physics. Later work repeatedly asks questions like:
- will this index survive the write workload, or will it collapse into random I/O?
- why is the 50th percentile latency
2 msbut the 99th is300 ms? - can I afford a new secondary index on this column, or will it break inserts?
- why does my transaction wait when nobody else seems to be reading or writing?
- what happens on crash, and how much data do I lose?
This module builds the implementation reasoning needed for:
- replication logs and partitioned placement (Module 3)
- isolation levels and concurrency under real workloads (Module 4)
- distributed storage under failure (Module 5)
- performance tuning, capacity planning, and query-plan debugging in every system design course thereafter
You are learning to reason about persistence and indexing without handwaving.
Concept Map
How To Use This Module
Work in order. The later clusters only make sense if the earlier storage habits are stable.
Cluster 1: On-Disk Storage Foundations
| Order | Concept | Type | Focus |
|---|---|---|---|
| 1 | Why Databases Care About Disk: Random vs Sequential I/O | PRIMARY | Latency hierarchy, block devices, and why access pattern dominates cost |
| 2 | Pages, Records, Heap Files, Slotted Pages | PRIMARY | The unit of transfer and how variable-length records are packed inside it |
| 3 | The Buffer Pool and Page Replacement | PRIMARY | Frames, pinning, dirty flags, and LRU/clock/LRU-K replacement |
Cluster mastery check: Can you explain what physically happens on disk and in memory when a single SELECT returns one row from a cold database?
Cluster 2: B-Tree Indexes
| Order | Concept | Type | Focus |
|---|---|---|---|
| 4 | B+-Tree Structure and Invariants | PRIMARY | Nodes, fan-out, balance invariants, splits, and merges |
| 5 | Point Lookups, Range Scans, and Why B+-Trees Dominate OLTP | PRIMARY | Height arithmetic, leaf-level linked lists, and mixed-read workloads |
| 6 | Clustered vs Secondary Indexes; Covering Indexes | PRIMARY | Row placement, index-only scans, and the extra-fetch problem |
Cluster mastery check: Given a table, a query, and an index definition, can you trace which pages are read and justify the plan?
Cluster 3: LSM and Alternative Index Structures
| Order | Concept | Type | Focus |
|---|---|---|---|
| 7 | LSM-Trees: SSTables, Memtables, Compaction | PRIMARY | Write-optimized design, leveled vs tiered compaction, amplification |
| 8 | Bloom Filters as LSM Read Accelerators | SUPPORTING | Probabilistic membership tests, false positives, and where they pay off |
| 9 | Hash, Bitmap, GIN/GiST: Specialty Indexes | SUPPORTING | When a B+-tree is the wrong shape for the workload |
Cluster mastery check: Given a workload description (write-heavy KV, analytical scan, full-text search, geospatial, low-cardinality filter), can you pick an index structure and defend it?
Cluster 4: Query Execution and Planning
| Order | Concept | Type | Focus |
|---|---|---|---|
| 10 | Query Execution Models: Volcano / Iterator, Vectorized, JIT | PRIMARY | Pulling tuples, batching for cache, and compiling to native code |
| 11 | Join Algorithms: Nested Loop, Hash, Sort-Merge | PRIMARY | Algorithm shape, memory budget, and when each wins |
| 12 | Statistics, Histograms, and the Cost-Based Optimizer | PRIMARY | Selectivity estimates, join order, and why bad stats kill performance |
Cluster mastery check: Given an EXPLAIN plan for a three-way join, can you name the operators, name the access paths, and estimate where the work happens?
Cluster 5: Concurrency Control in Storage
| Order | Concept | Type | Focus |
|---|---|---|---|
| 13 | Locking-Based Concurrency: 2PL, Lock Escalation | PRIMARY | Shared/exclusive modes, growing and shrinking phases, deadlock |
| 14 | MVCC: How Snapshots Avoid Reader-Writer Contention | PRIMARY | Versions, visibility rules, and snapshot isolation |
| 15 | Write-Ahead Logging (WAL) and Recovery with ARIES Sketch | PRIMARY | Log-before-data, redo, undo, checkpoints, and crash recovery |
Cluster mastery check: Can you walk through what happens during and after a crash mid-transaction under a WAL+MVCC engine, and say which data survives?
Then work these practice pages:
| Order | Practice path | Focus |
|---|---|---|
| 1 | Storage Engine Foundations Lab | I/O reasoning, page layout, buffer-pool thinking |
| 2 | Index Selection and Access Path Workshop | B+-tree vs LSM, clustered vs secondary, specialty indexes |
| 3 | Query Planning and Concurrency Clinic | Joins, cost estimation, locking, MVCC, recovery scenarios |
| 4 | Code Katas | B+-tree splits, toy LSM, index vs scan benchmarks, join cost walkthrough |
Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.
Learning Objectives
By the end of this module you should be able to:
- Explain why random vs sequential I/O shapes engine design, and estimate the cost of a query in page reads.
- Describe page-level storage (heap files, slotted pages, record headers) and show where a new row physically lands.
- Explain what the buffer pool does, including pin counts, dirty pages, and at least two replacement policies.
- Draw a B+-tree split and a merge, compute height from fan-out, and state the balance invariants.
- Discriminate clustered, secondary, and covering indexes and predict when an index-only scan is possible.
- Sketch an LSM-tree write path, a compaction step, and the role of a Bloom filter in read lookups.
- Match a workload to hash, bitmap, GIN, or GiST when a B+-tree is not the right shape.
- Distinguish volcano, vectorized, and JIT execution models and name where each pays off.
- Select between nested loop, hash, and sort-merge join and justify the choice against memory and input sizes.
- Describe 2PL, MVCC, and WAL precisely enough to explain a real
EXPLAINor crash-recovery trace.
Outputs
- a storage notebook with at least 10 I/O cost estimates for realistic queries
- one B+-tree worksheet with at least 5 inserts causing splits and 2 deletes causing merges, drawn by hand
- one LSM timeline diagram showing at least 4 flushes and 2 compactions, annotated with amplification
- one index-selection memo: 8 workloads, each with a chosen index type and a one-paragraph justification
- one
EXPLAINreading log: 6 real query plans broken down into operators and access paths - one join-algorithm sheet: for at least 5 scenarios, pick an algorithm and estimate its cost
- one concurrency scenarios log: 6 interleavings under 2PL and under MVCC, showing outcomes
- one WAL recovery walkthrough for a mid-transaction crash, redo and undo phases
- one mistake log naming at least 12 engine-reasoning errors (
confused logical and physical read,assumed index-only scan without covering,used nested loop on unordered large inputs,forgot checkpoint in WAL reasoning, etc.) - one short memo relating Module 2 habits to replication, transactions, and distributed systems ahead
Completion Standard
You have completed Module 2 when all of these are true:
- you can explain what a query touches physically, not only logically
- you can draw a B+-tree split and a merge without rereading
- you can sketch the LSM write path and compaction behavior and say what they cost
- you can pick an index for a new workload and defend it against at least one alternative
- you can read an
EXPLAINplan and name the access path, join method, and likely cost driver - you can explain the difference between a deadlock under 2PL and a conflict under snapshot isolation
- you can walk through crash recovery using WAL semantics
If the database "is fast enough for now" but you cannot explain why, the module is not complete.
Reading Policy
- Concept pages are the main path.
- Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuckmeans try the concept page, self-check, and drill first.Optional deep divemeans additional nuance or exercise volume, not required progression.- Because this is an implementation-depth module, written reasoning about pages, plans, and recovery is required, not optional enrichment.
Suggested Weekly Flow
| Day | Work |
|---|---|
| 1 | Concepts 1-3 and one page-layout diagram of your own |
| 2 | Concepts 4-5 and one hand-drawn B+-tree split sheet |
| 3 | Concept 6, index-selection memo for at least 4 workloads |
| 4 | Concepts 7-8, LSM compaction timeline drawing, toy memtable code |
| 5 | Concept 9 and Practice 1 + 2 |
| 6 | Concepts 10-11 and an EXPLAIN reading session on a real database |
| 7 | Concept 12, join-algorithm worksheet |
| 8 | Concepts 13-14 and concurrency scenarios log |
| 9 | Concept 15 and WAL recovery walkthrough; Practice 3 |
| 10 | Practice 4 (code katas), quiz, mistake-log cleanup |
Reference
If you need exact links into the local chunked books, use Reference and Selective Reading.
Use Database (KV) for disk-first storage and indexing, Redis Server for in-memory protocol/server behavior, and Search Engine for inverted indexes and ranking. These are three different storage shapes; do not collapse them into one project. See Build Your Own X overview.
Rich Learning Pages
Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread