Skip to main content

Index Selection and Access Path Workshop

Retrieval Prompts

  1. State the B+-tree balance invariant and which operation is the only one that grows the tree in height.
  2. Give the height of a B+-tree as a function of fan-out and leaf count.
  3. State the rule for when an index-only scan is possible.
  4. Sketch the LSM-tree write path: where a write lands and when it reaches disk.
  5. State when a Bloom filter returns No and what that means for an LSM read.

Compare and Distinguish

Separate these pairs clearly:

  • B+-tree vs LSM-tree under heavy random-key writes
  • clustered vs secondary index lookup cost
  • covering index vs index with INCLUDE columns
  • hash index vs B+-tree for equality
  • GIN vs GiST vs B+-tree for non-scalar column types

Common Mistake Check

For each statement, identify the error:

  1. "A secondary index always reduces I/O."
  2. "An LSM-tree has no random I/O."
  3. "A Bloom filter can give false negatives."
  4. "All indexes help range scans."
  5. "Adding more indexes only costs disk space, not write performance."
  6. "Bitmap indexes are best for high-cardinality columns."

Index Selection Drill

For each workload, select an index type and justify in one paragraph with specific costs:

  1. A clickstream table gets 10^5 inserts/s, with reads mostly scanning (user_id, ts) in time order.
  2. A users table is read by email for login and never scanned.
  3. A documents table stores jsonb payloads; filters often check data @> '{"tag": "..."}'.
  4. An events table has columns status (4 values), severity (3 values), region (15 values). Queries are WHERE combinations of any two.
  5. A places table with (lat, lon) coordinates; queries find nearest neighbors.
  6. A transactions PK-clustered table with random UUIDv4 keys experiences insert-throughput collapse as it grows.

B+-Tree Split Drill

Using an order-4 B+-tree (max 3 entries per leaf), insert the following in order. Draw the tree (ASCII) after every operation that causes a split or a merge:

10, 20, 30, 40, 5, 15, 25, 35, 45, 50, 55, 60, 65, 70

Then delete in order: 10, 20, 30, 40, 50. Draw after every underflow.

At the end, state the final tree height and the number of leaves.

LSM Compaction Worksheet

You have memtable size 32 MB, L0 compact-trigger 4 SSTables, L1 target 256 MB, 10x growth per level, sustained write rate 5 MB/s, average key-value size 512 B, 1% Bloom FPR per SSTable.

  1. How often does the memtable flush?
  2. How many SSTables accumulate in L0 before compaction?
  3. How large will L2 be when the database holds ~50 GB of data?
  4. Expected number of random reads for a GET of an absent key, with Bloom filters, assuming 40 SSTables exist.
  5. If the same workload used a B+-tree instead, what changes in the write-side cost?

Evidence Check

This page is complete only if you can look at a new workload description and defend a choice of index structure in writing, with numbers, against at least one reasonable alternative.

Document Model Contrast

Model the restaurant reservation workflow twice: once relationally and once as documents.

For the document version:

  1. Design collections for guests, reservations, orders, and menu items.
  2. Decide what to embed and what to reference.
  3. Write CRUD examples for creating a reservation, adding an order item, changing payment status, and finding today's reservations.
  4. Add a basic authorization note: which roles may read or mutate each collection.
  5. Identify one query filter that needs an index and one access pattern that would be worse than the relational version.

Decision prompt: write a short relational-vs-document tradeoff note. Include consistency boundary, query shape, update frequency, and reporting needs.