Module 2: Storage Engines & Indexing

Primary text: Database Internals (Petrov)
Selective support: Designing Data-Intensive Applications Chapter 3 for architectural framing, Database System Concepts for classical rigor on indexes, query processing, concurrency, and recovery; Distributed Systems: Concepts and Design only when storage crosses machine boundaries

This guide is the primary teacher. You do not need to read the source books front-to-back to complete this module. You do need to become operationally strong at disk-aware storage reasoning, index structure choice, query execution cost, and the concurrency primitives that sit inside a single node before Module 3 takes the database across machines.

Scope of This Module

This module is not a catalog of storage engine brand names. It is where database performance stops being a magic box and becomes something you can reason about in pages, seeks, and access paths.

What it covers in depth:

why databases treat disk, SSD, and memory hierarchies differently, and how random vs sequential I/O dominates design
how pages, records, heap files, and slotted pages arrange bytes so the buffer pool can move them around
what the buffer pool actually does and why page replacement (LRU, clock, LRU-K) is not an afterthought
B+-tree structure, invariants, splits, merges, and why it wins for mixed OLTP workloads
point lookups, range scans, and the arithmetic of tree height vs fan-out
clustered vs secondary indexes, covering indexes, and index-only scans
LSM-trees: memtables, SSTables, compaction strategies, and read/write/space amplification tradeoffs
Bloom filters as LSM read accelerators and where they buy you speed vs false-positive cost
hash, bitmap, GIN, and GiST as specialty index structures and when they are justified
query execution models: volcano/iterator, vectorized, and JIT-compiled pipelines
join algorithms: nested loop, hash, sort-merge, and how the optimizer chooses between them
statistics, histograms, selectivity estimation, and what a cost-based optimizer actually does
lock-based concurrency control: 2PL, lock modes, escalation, deadlock
MVCC and snapshot isolation as the concurrency story modern engines actually tell
write-ahead logging (WAL) and the ARIES recovery model, including redo, undo, and checkpoints

What it deliberately does not try to finish here:

replication, partitioning, and cross-node placement (Module 3)
the full ACID/isolation spectrum and consistency theory (Module 4)
distributed transactions, consensus, and global time (Module 5)
hand-rolling a production storage engine

This is the implementation-depth foundation for the rest of Semester 6.

Before You Start

Answer these closed-book before starting the main path:

Why does a database care whether an I/O is random or sequential if both return the same bytes?
If a B+-tree has fan-out 250 and holds 10^9 keys, roughly how tall is it, and why does that matter for latency?
What is the difference between an LSM-tree and a B-tree in terms of where new writes go?
Why does a secondary index usually cost you an extra I/O compared to a clustered index?
What does "the database runs out of memory" actually mean if the buffer pool is full?

Diagnostic Interpretation

4-5 solid answers

You are ready for the full path.

2-3 solid answers

Continue, but expect extra time in the B-tree cluster and the LSM cluster.

0-1 solid answers

Revisit Module 1 (tables, rows, queries) and Semester 5 memory-hierarchy material before starting. Storage engine reasoning assumes both.

What This Module Is For

Storage engines are the layer where database claims meet physics. Later work repeatedly asks questions like:

will this index survive the write workload, or will it collapse into random I/O?
why is the 50th percentile latency 2 ms but the 99th is 300 ms?
can I afford a new secondary index on this column, or will it break inserts?
why does my transaction wait when nobody else seems to be reading or writing?
what happens on crash, and how much data do I lose?

This module builds the implementation reasoning needed for:

replication logs and partitioned placement (Module 3)
isolation levels and concurrency under real workloads (Module 4)
distributed storage under failure (Module 5)
performance tuning, capacity planning, and query-plan debugging in every system design course thereafter

You are learning to reason about persistence and indexing without handwaving.

Concept Map

How To Use This Module

Work in order. The later clusters only make sense if the earlier storage habits are stable.

Cluster 1: On-Disk Storage Foundations

Order	Concept	Type	Focus
1	Why Databases Care About Disk: Random vs Sequential I/O	PRIMARY	Latency hierarchy, block devices, and why access pattern dominates cost
2	Pages, Records, Heap Files, Slotted Pages	PRIMARY	The unit of transfer and how variable-length records are packed inside it
3	The Buffer Pool and Page Replacement	PRIMARY	Frames, pinning, dirty flags, and LRU/clock/LRU-K replacement

Cluster mastery check: Can you explain what physically happens on disk and in memory when a single SELECT returns one row from a cold database?

Cluster 2: B-Tree Indexes

Order	Concept	Type	Focus
4	B+-Tree Structure and Invariants	PRIMARY	Nodes, fan-out, balance invariants, splits, and merges
5	Point Lookups, Range Scans, and Why B+-Trees Dominate OLTP	PRIMARY	Height arithmetic, leaf-level linked lists, and mixed-read workloads
6	Clustered vs Secondary Indexes; Covering Indexes	PRIMARY	Row placement, index-only scans, and the extra-fetch problem

Cluster mastery check: Given a table, a query, and an index definition, can you trace which pages are read and justify the plan?

Cluster 3: LSM and Alternative Index Structures

Order	Concept	Type	Focus
7	LSM-Trees: SSTables, Memtables, Compaction	PRIMARY	Write-optimized design, leveled vs tiered compaction, amplification
8	Bloom Filters as LSM Read Accelerators	SUPPORTING	Probabilistic membership tests, false positives, and where they pay off
9	Hash, Bitmap, GIN/GiST: Specialty Indexes	SUPPORTING	When a B+-tree is the wrong shape for the workload

Cluster mastery check: Given a workload description (write-heavy KV, analytical scan, full-text search, geospatial, low-cardinality filter), can you pick an index structure and defend it?

Cluster 4: Query Execution and Planning

Order	Concept	Type	Focus
10	Query Execution Models: Volcano / Iterator, Vectorized, JIT	PRIMARY	Pulling tuples, batching for cache, and compiling to native code
11	Join Algorithms: Nested Loop, Hash, Sort-Merge	PRIMARY	Algorithm shape, memory budget, and when each wins
12	Statistics, Histograms, and the Cost-Based Optimizer	PRIMARY	Selectivity estimates, join order, and why bad stats kill performance

Cluster mastery check: Given an EXPLAIN plan for a three-way join, can you name the operators, name the access paths, and estimate where the work happens?

Cluster 5: Concurrency Control in Storage

Order	Concept	Type	Focus
13	Locking-Based Concurrency: 2PL, Lock Escalation	PRIMARY	Shared/exclusive modes, growing and shrinking phases, deadlock
14	MVCC: How Snapshots Avoid Reader-Writer Contention	PRIMARY	Versions, visibility rules, and snapshot isolation
15	Write-Ahead Logging (WAL) and Recovery with ARIES Sketch	PRIMARY	Log-before-data, redo, undo, checkpoints, and crash recovery

Cluster mastery check: Can you walk through what happens during and after a crash mid-transaction under a WAL+MVCC engine, and say which data survives?

Then work these practice pages:

Order	Practice path	Focus
1	Storage Engine Foundations Lab	I/O reasoning, page layout, buffer-pool thinking
2	Index Selection and Access Path Workshop	B+-tree vs LSM, clustered vs secondary, specialty indexes
3	Query Planning and Concurrency Clinic	Joins, cost estimation, locking, MVCC, recovery scenarios
4	Code Katas	B+-tree splits, toy LSM, index vs scan benchmarks, join cost walkthrough

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.

Learning Objectives

By the end of this module you should be able to:

Explain why random vs sequential I/O shapes engine design, and estimate the cost of a query in page reads.
Describe page-level storage (heap files, slotted pages, record headers) and show where a new row physically lands.
Explain what the buffer pool does, including pin counts, dirty pages, and at least two replacement policies.
Draw a B+-tree split and a merge, compute height from fan-out, and state the balance invariants.
Discriminate clustered, secondary, and covering indexes and predict when an index-only scan is possible.
Sketch an LSM-tree write path, a compaction step, and the role of a Bloom filter in read lookups.
Match a workload to hash, bitmap, GIN, or GiST when a B+-tree is not the right shape.
Distinguish volcano, vectorized, and JIT execution models and name where each pays off.
Select between nested loop, hash, and sort-merge join and justify the choice against memory and input sizes.
Describe 2PL, MVCC, and WAL precisely enough to explain a real EXPLAIN or crash-recovery trace.

Outputs

a storage notebook with at least 10 I/O cost estimates for realistic queries
one B+-tree worksheet with at least 5 inserts causing splits and 2 deletes causing merges, drawn by hand
one LSM timeline diagram showing at least 4 flushes and 2 compactions, annotated with amplification
one index-selection memo: 8 workloads, each with a chosen index type and a one-paragraph justification
one EXPLAIN reading log: 6 real query plans broken down into operators and access paths
one join-algorithm sheet: for at least 5 scenarios, pick an algorithm and estimate its cost
one concurrency scenarios log: 6 interleavings under 2PL and under MVCC, showing outcomes
one WAL recovery walkthrough for a mid-transaction crash, redo and undo phases
one mistake log naming at least 12 engine-reasoning errors (confused logical and physical read, assumed index-only scan without covering, used nested loop on unordered large inputs, forgot checkpoint in WAL reasoning, etc.)
one short memo relating Module 2 habits to replication, transactions, and distributed systems ahead

Completion Standard

You have completed Module 2 when all of these are true:

you can explain what a query touches physically, not only logically
you can draw a B+-tree split and a merge without rereading
you can sketch the LSM write path and compaction behavior and say what they cost
you can pick an index for a new workload and defend it against at least one alternative
you can read an EXPLAIN plan and name the access path, join method, and likely cost driver
you can explain the difference between a deadlock under 2PL and a conflict under snapshot isolation
you can walk through crash recovery using WAL semantics

If the database "is fast enough for now" but you cannot explain why, the module is not complete.

Reading Policy

Concept pages are the main path.
Local book chunks are selective reinforcement, not a second syllabus.
Read only if stuck means try the concept page, self-check, and drill first.
Optional deep dive means additional nuance or exercise volume, not required progression.
Because this is an implementation-depth module, written reasoning about pages, plans, and recovery is required, not optional enrichment.

Suggested Weekly Flow

Day	Work
1	Concepts 1-3 and one page-layout diagram of your own
2	Concepts 4-5 and one hand-drawn B+-tree split sheet
3	Concept 6, index-selection memo for at least 4 workloads
4	Concepts 7-8, LSM compaction timeline drawing, toy memtable code
5	Concept 9 and Practice 1 + 2
6	Concepts 10-11 and an `EXPLAIN` reading session on a real database
7	Concept 12, join-algorithm worksheet
8	Concepts 13-14 and concurrency scenarios log
9	Concept 15 and WAL recovery walkthrough; Practice 3
10	Practice 4 (code katas), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.

Build Your Own X - elective

Use Database (KV) for disk-first storage and indexing, Redis Server for in-memory protocol/server behavior, and Search Engine for inverted indexes and ranking. These are three different storage shapes; do not collapse them into one project. See Build Your Own X overview.

Rich Learning Pages

Scope of This Module​

Before You Start​

Diagnostic Interpretation​

What This Module Is For​

Concept Map​

How To Use This Module​

Cluster 1: On-Disk Storage Foundations​

Cluster 2: B-Tree Indexes​

Cluster 3: LSM and Alternative Index Structures​

Cluster 4: Query Execution and Planning​

Cluster 5: Concurrency Control in Storage​

Learning Objectives​

Outputs​

Completion Standard​

Reading Policy​

Suggested Weekly Flow​

Reference​

Rich Learning Pages​