Skip to main content

Module 2: Storage Engines & Indexing

Primary text: Database Internals (Petrov)
Selective support: Designing Data-Intensive Applications Chapter 3 for architectural framing, Database System Concepts for classical rigor on indexes, query processing, concurrency, and recovery; Distributed Systems: Concepts and Design only when storage crosses machine boundaries

This guide is the primary teacher. You do not need to read the source books front-to-back to complete this module. You do need to become operationally strong at disk-aware storage reasoning, index structure choice, query execution cost, and the concurrency primitives that sit inside a single node before Module 3 takes the database across machines.


Scope of This Module

This module is not a catalog of storage engine brand names. It is where database performance stops being a magic box and becomes something you can reason about in pages, seeks, and access paths.

What it covers in depth:

  • why databases treat disk, SSD, and memory hierarchies differently, and how random vs sequential I/O dominates design
  • how pages, records, heap files, and slotted pages arrange bytes so the buffer pool can move them around
  • what the buffer pool actually does and why page replacement (LRU, clock, LRU-K) is not an afterthought
  • B+-tree structure, invariants, splits, merges, and why it wins for mixed OLTP workloads
  • point lookups, range scans, and the arithmetic of tree height vs fan-out
  • clustered vs secondary indexes, covering indexes, and index-only scans
  • LSM-trees: memtables, SSTables, compaction strategies, and read/write/space amplification tradeoffs
  • Bloom filters as LSM read accelerators and where they buy you speed vs false-positive cost
  • hash, bitmap, GIN, and GiST as specialty index structures and when they are justified
  • query execution models: volcano/iterator, vectorized, and JIT-compiled pipelines
  • join algorithms: nested loop, hash, sort-merge, and how the optimizer chooses between them
  • statistics, histograms, selectivity estimation, and what a cost-based optimizer actually does
  • lock-based concurrency control: 2PL, lock modes, escalation, deadlock
  • MVCC and snapshot isolation as the concurrency story modern engines actually tell
  • write-ahead logging (WAL) and the ARIES recovery model, including redo, undo, and checkpoints

What it deliberately does not try to finish here:

  • replication, partitioning, and cross-node placement (Module 3)
  • the full ACID/isolation spectrum and consistency theory (Module 4)
  • distributed transactions, consensus, and global time (Module 5)
  • hand-rolling a production storage engine

This is the implementation-depth foundation for the rest of Semester 6.


Before You Start

Answer these closed-book before starting the main path:

  1. Why does a database care whether an I/O is random or sequential if both return the same bytes?
  2. If a B+-tree has fan-out 250 and holds 10^9 keys, roughly how tall is it, and why does that matter for latency?
  3. What is the difference between an LSM-tree and a B-tree in terms of where new writes go?
  4. Why does a secondary index usually cost you an extra I/O compared to a clustered index?
  5. What does "the database runs out of memory" actually mean if the buffer pool is full?

Diagnostic Interpretation

4-5 solid answers

  • You are ready for the full path.

2-3 solid answers

  • Continue, but expect extra time in the B-tree cluster and the LSM cluster.

0-1 solid answers

  • Revisit Module 1 (tables, rows, queries) and Semester 5 memory-hierarchy material before starting. Storage engine reasoning assumes both.

What This Module Is For

Storage engines are the layer where database claims meet physics. Later work repeatedly asks questions like:

  • will this index survive the write workload, or will it collapse into random I/O?
  • why is the 50th percentile latency 2 ms but the 99th is 300 ms?
  • can I afford a new secondary index on this column, or will it break inserts?
  • why does my transaction wait when nobody else seems to be reading or writing?
  • what happens on crash, and how much data do I lose?

This module builds the implementation reasoning needed for:

  • replication logs and partitioned placement (Module 3)
  • isolation levels and concurrency under real workloads (Module 4)
  • distributed storage under failure (Module 5)
  • performance tuning, capacity planning, and query-plan debugging in every system design course thereafter

You are learning to reason about persistence and indexing without handwaving.


Concept Map


How To Use This Module

Work in order. The later clusters only make sense if the earlier storage habits are stable.

Cluster 1: On-Disk Storage Foundations

OrderConceptTypeFocus
1Why Databases Care About Disk: Random vs Sequential I/OPRIMARYLatency hierarchy, block devices, and why access pattern dominates cost
2Pages, Records, Heap Files, Slotted PagesPRIMARYThe unit of transfer and how variable-length records are packed inside it
3The Buffer Pool and Page ReplacementPRIMARYFrames, pinning, dirty flags, and LRU/clock/LRU-K replacement

Cluster mastery check: Can you explain what physically happens on disk and in memory when a single SELECT returns one row from a cold database?

Cluster 2: B-Tree Indexes

OrderConceptTypeFocus
4B+-Tree Structure and InvariantsPRIMARYNodes, fan-out, balance invariants, splits, and merges
5Point Lookups, Range Scans, and Why B+-Trees Dominate OLTPPRIMARYHeight arithmetic, leaf-level linked lists, and mixed-read workloads
6Clustered vs Secondary Indexes; Covering IndexesPRIMARYRow placement, index-only scans, and the extra-fetch problem

Cluster mastery check: Given a table, a query, and an index definition, can you trace which pages are read and justify the plan?

Cluster 3: LSM and Alternative Index Structures

OrderConceptTypeFocus
7LSM-Trees: SSTables, Memtables, CompactionPRIMARYWrite-optimized design, leveled vs tiered compaction, amplification
8Bloom Filters as LSM Read AcceleratorsSUPPORTINGProbabilistic membership tests, false positives, and where they pay off
9Hash, Bitmap, GIN/GiST: Specialty IndexesSUPPORTINGWhen a B+-tree is the wrong shape for the workload

Cluster mastery check: Given a workload description (write-heavy KV, analytical scan, full-text search, geospatial, low-cardinality filter), can you pick an index structure and defend it?

Cluster 4: Query Execution and Planning

OrderConceptTypeFocus
10Query Execution Models: Volcano / Iterator, Vectorized, JITPRIMARYPulling tuples, batching for cache, and compiling to native code
11Join Algorithms: Nested Loop, Hash, Sort-MergePRIMARYAlgorithm shape, memory budget, and when each wins
12Statistics, Histograms, and the Cost-Based OptimizerPRIMARYSelectivity estimates, join order, and why bad stats kill performance

Cluster mastery check: Given an EXPLAIN plan for a three-way join, can you name the operators, name the access paths, and estimate where the work happens?

Cluster 5: Concurrency Control in Storage

OrderConceptTypeFocus
13Locking-Based Concurrency: 2PL, Lock EscalationPRIMARYShared/exclusive modes, growing and shrinking phases, deadlock
14MVCC: How Snapshots Avoid Reader-Writer ContentionPRIMARYVersions, visibility rules, and snapshot isolation
15Write-Ahead Logging (WAL) and Recovery with ARIES SketchPRIMARYLog-before-data, redo, undo, checkpoints, and crash recovery

Cluster mastery check: Can you walk through what happens during and after a crash mid-transaction under a WAL+MVCC engine, and say which data survives?

Then work these practice pages:

OrderPractice pathFocus
1Storage Engine Foundations LabI/O reasoning, page layout, buffer-pool thinking
2Index Selection and Access Path WorkshopB+-tree vs LSM, clustered vs secondary, specialty indexes
3Query Planning and Concurrency ClinicJoins, cost estimation, locking, MVCC, recovery scenarios
4Code KatasB+-tree splits, toy LSM, index vs scan benchmarks, join cost walkthrough

Use Module Quiz after the concept and practice path. Use Reference and Selective Reading and Learning Resources only for targeted reinforcement.


Learning Objectives

By the end of this module you should be able to:

  1. Explain why random vs sequential I/O shapes engine design, and estimate the cost of a query in page reads.
  2. Describe page-level storage (heap files, slotted pages, record headers) and show where a new row physically lands.
  3. Explain what the buffer pool does, including pin counts, dirty pages, and at least two replacement policies.
  4. Draw a B+-tree split and a merge, compute height from fan-out, and state the balance invariants.
  5. Discriminate clustered, secondary, and covering indexes and predict when an index-only scan is possible.
  6. Sketch an LSM-tree write path, a compaction step, and the role of a Bloom filter in read lookups.
  7. Match a workload to hash, bitmap, GIN, or GiST when a B+-tree is not the right shape.
  8. Distinguish volcano, vectorized, and JIT execution models and name where each pays off.
  9. Select between nested loop, hash, and sort-merge join and justify the choice against memory and input sizes.
  10. Describe 2PL, MVCC, and WAL precisely enough to explain a real EXPLAIN or crash-recovery trace.

Outputs

  • a storage notebook with at least 10 I/O cost estimates for realistic queries
  • one B+-tree worksheet with at least 5 inserts causing splits and 2 deletes causing merges, drawn by hand
  • one LSM timeline diagram showing at least 4 flushes and 2 compactions, annotated with amplification
  • one index-selection memo: 8 workloads, each with a chosen index type and a one-paragraph justification
  • one EXPLAIN reading log: 6 real query plans broken down into operators and access paths
  • one join-algorithm sheet: for at least 5 scenarios, pick an algorithm and estimate its cost
  • one concurrency scenarios log: 6 interleavings under 2PL and under MVCC, showing outcomes
  • one WAL recovery walkthrough for a mid-transaction crash, redo and undo phases
  • one mistake log naming at least 12 engine-reasoning errors (confused logical and physical read, assumed index-only scan without covering, used nested loop on unordered large inputs, forgot checkpoint in WAL reasoning, etc.)
  • one short memo relating Module 2 habits to replication, transactions, and distributed systems ahead

Completion Standard

You have completed Module 2 when all of these are true:

  • you can explain what a query touches physically, not only logically
  • you can draw a B+-tree split and a merge without rereading
  • you can sketch the LSM write path and compaction behavior and say what they cost
  • you can pick an index for a new workload and defend it against at least one alternative
  • you can read an EXPLAIN plan and name the access path, join method, and likely cost driver
  • you can explain the difference between a deadlock under 2PL and a conflict under snapshot isolation
  • you can walk through crash recovery using WAL semantics

If the database "is fast enough for now" but you cannot explain why, the module is not complete.


Reading Policy

  • Concept pages are the main path.
  • Local book chunks are selective reinforcement, not a second syllabus.
  • Read only if stuck means try the concept page, self-check, and drill first.
  • Optional deep dive means additional nuance or exercise volume, not required progression.
  • Because this is an implementation-depth module, written reasoning about pages, plans, and recovery is required, not optional enrichment.

Suggested Weekly Flow

DayWork
1Concepts 1-3 and one page-layout diagram of your own
2Concepts 4-5 and one hand-drawn B+-tree split sheet
3Concept 6, index-selection memo for at least 4 workloads
4Concepts 7-8, LSM compaction timeline drawing, toy memtable code
5Concept 9 and Practice 1 + 2
6Concepts 10-11 and an EXPLAIN reading session on a real database
7Concept 12, join-algorithm worksheet
8Concepts 13-14 and concurrency scenarios log
9Concept 15 and WAL recovery walkthrough; Practice 3
10Practice 4 (code katas), quiz, mistake-log cleanup

Reference

If you need exact links into the local chunked books, use Reference and Selective Reading.


Build Your Own X - elective

Use Database (KV) for disk-first storage and indexing, Redis Server for in-memory protocol/server behavior, and Search Engine for inverted indexes and ranking. These are three different storage shapes; do not collapse them into one project. See Build Your Own X overview.

Rich Learning Pages

Worked Examples | Guided Labs | Case Studies | Mistake Clinic | Reading Guide | Capstone Thread