Skip to main content

Exercises

Use these after the concept pages and practice path. Every exercise has a written deliverable -- the point is not solving it once but building a reusable artifact you can return to.

Section 1: Storage Foundations (Cluster 1)

Exercise 1.1: Cost Sheet

Pick five realistic SQL queries from any application you know. For each, estimate, on paper:

  • number of logical page reads (assume 8 KB pages)
  • number of random vs sequential reads
  • whether the buffer pool is likely to help a repeat query

Then run them against a real database with EXPLAIN (ANALYZE, BUFFERS) and reconcile the difference with your estimate.

Deliverable: one-page table per query with estimated vs observed reads and a single sentence explaining any miss.

Exercise 1.2: Slotted Page Drawing

Draw, by hand:

  1. An empty 8 KB page with header.
  2. The same page after inserting 5 records of lengths 120, 80, 200, 60, 150 bytes.
  3. The page after deleting record 3 and inserting one of length 220 bytes.

Show the slot array direction, the free-space boundary, and every record's offset.

Deliverable: one sketch.

Exercise 1.3: LRU vs Clock Instrument

Pick any open-source cache library. Instrument it to log hits, misses, and evictions. Run a synthetic workload that mimics a repeated scan and a working set. Explain the resulting miss curve.

Deliverable: one graph (hit rate vs working-set size) plus a 200-word analysis.

Section 2: B-Tree Indexes (Cluster 2)

Exercise 2.1: Height Estimates

For each scenario, compute the B+-tree height:

  1. Fan-out 50, 10^7 keys.
  2. Fan-out 100, 10^9 keys.
  3. Fan-out 500, 10^10 keys.
  4. Fan-out 32, 10^6 keys.

State, for each, how many cold-cache page reads a point lookup costs (clustered) and how many for a secondary index.

Exercise 2.2: Access Path Design

Given a products table (id PK, sku, category, price, created_at) with these workloads:

  1. WHERE sku = ? (millions of unique SKUs, equality)
  2. WHERE category = ? ORDER BY price DESC LIMIT 20
  3. WHERE created_at > now() - interval '7 days' ORDER BY created_at

Design one index per workload and justify each. Then redesign with a constraint: only two indexes total are allowed. Which workload degrades and by how much?

Exercise 2.3: Split Trace

Starting from an empty order-5 B+-tree (4 keys max per node), insert the integers 1..20 in order, then in reverse order. Draw the final tree for each sequence and explain why the resulting shapes differ.

Section 3: LSM and Specialty Indexes (Cluster 3)

Exercise 3.1: Amplification Worksheet

Given write rate 10 MB/s, memtable 64 MB, leveled compaction with 10x growth per level, and a steady-state total data size of 200 GB:

  1. How many levels exist?
  2. What is the approximate write amplification?
  3. If you switch to size-tiered compaction, how do read amplification and space amplification change?

Exercise 3.2: Bloom Budget

You have a 5% memory budget for Bloom filters across 100 SSTables containing 10^8 keys total. Compute the resulting false-positive rate per SSTable and the expected number of disk reads for a negative lookup across all SSTables.

Exercise 3.3: Specialty Index Survey

For each of the following, identify the most appropriate index type and write one paragraph justifying:

  • IP address range lookups
  • Full-text search over user-generated content
  • Geographic point-in-polygon
  • Filtering analytic events by (region, event_type) with 10 and 20 distinct values
  • An append-only time-series table with billion-row range scans by timestamp

Section 4: Query Execution and Planning (Cluster 4)

Exercise 4.1: EXPLAIN Reading Log

Pick any production or sample database. Record and annotate at least 10 EXPLAIN (ANALYZE, BUFFERS) outputs covering:

  • one seq scan that was correctly chosen
  • one seq scan that was incorrectly chosen
  • one index scan that avoided a heap fetch
  • one hash join
  • one nested loop join
  • one sort-merge join
  • one plan with an obvious statistics failure (estimate vs actual off by at least 10x)
  • one plan before and after ANALYZE
  • one plan that switched when a new index was added
  • one plan with a lateral or correlated subquery

Exercise 4.2: Join Order Experiment

Write a three-way join query, force the planner (with hints or set enable_mergejoin, set enable_hashjoin) to use each of three join algorithms, and compare actual execution time. Explain the gap.

Exercise 4.3: Selectivity Estimation

Pick one column with a skewed distribution in your database. Produce:

  • the default histogram Postgres generates
  • the selectivity estimate for a specific predicate
  • the actual count
  • a proposal for fixing a mis-estimate (extended statistics, histogram resolution, functional index, etc.)

Section 5: Concurrency and Recovery (Cluster 5)

Exercise 5.1: Lock Scenario Matrix

For each of the following interleavings, state what strict 2PL does, what snapshot isolation does, and whether each is serializable:

  1. T1: r(A), w(A) vs T2: r(A), w(A)
  2. T1: r(A), w(B) vs T2: r(B), w(A)
  3. T1: w(A), r(B) vs T2: w(B), r(A)
  4. T1: r(range X), w(new row in X) vs T2: r(range X)

Exercise 5.2: MVCC Bloat Observation

In Postgres, repeatedly UPDATE a single row many times inside a long-running read transaction. Before committing the reader, measure pg_relation_size for the table. Commit the reader, run VACUUM, and measure again. Write one paragraph explaining exactly what you observed in terms of visibility and garbage collection.

Exercise 5.3: WAL Walkthrough

Produce a written walkthrough of a crash recovery for a WAL tail of at least 10 records including BEGIN, UPDATE, COMMIT, and a mid-transaction crash. Mark Analysis, Redo, and Undo phases and name every CLR written.

Exercise 5.4: Crash-Test on SQLite

Use SQLite's synchronous modes. Run a single transaction that writes many rows. Kill the process mid-transaction with and without synchronous=FULL. Document what you observe on restart.

Completion Standard

  • Every section has at least one written artifact
  • The EXPLAIN reading log contains at least 10 plans
  • All numeric exercises show work, not just answers
  • Mistakes and surprises are logged in a separate running document