Skip to main content

Query Planning and Concurrency Clinic

Retrieval Prompts

  1. Describe the volcano, vectorized, and JIT execution models in one sentence each.
  2. State when hash join beats nested loop and when nested loop beats hash join.
  3. Explain the role of histograms in cost-based optimization.
  4. State the 2PL rule and why it guarantees serializability.
  5. Describe what an MVCC snapshot is and how visibility is decided.
  6. State the write-ahead logging rule in one sentence.

Compare and Distinguish

Separate these pairs clearly:

  • volcano iterator vs vectorized execution
  • nested loop vs hash join vs sort-merge join
  • estimated vs actual row counts in a plan
  • strict 2PL vs snapshot isolation
  • WAL before-image vs after-image logging
  • redo phase vs undo phase in ARIES

Common Mistake Check

For each statement, identify the error:

  1. "JIT always beats volcano execution."
  2. "Hash join is always faster than nested loop for large inputs."
  3. "Histograms are only needed for non-uniform distributions; otherwise NDV is enough."
  4. "Under MVCC, readers and writers never interact."
  5. "Snapshot isolation is serializable."
  6. "WAL doubles storage write cost; sequentiality makes it worse, not better."

EXPLAIN-Reading Drill

For each described plan, list (a) the access path chosen, (b) the join algorithm, (c) the likely cost driver, (d) one change you would consider:

  1. Seq Scan on orders (rows=10^8) -> Hash Join (estimated 5 rows, actual 5*10^7) against a hash-built users table.
  2. Index Only Scan on idx_user_id (rows=200) returning exactly the projection columns.
  3. Nested Loop with outer filter users (rows=10^5) and inner Index Scan on orders_user_id.
  4. Merge Join with Sort child on one side and Index Scan on the other.
  5. Bitmap Heap Scan with two Bitmap Index Scan children ANDed together.

Join-Algorithm Cost Walkthrough

Run this cost walkthrough for a three-way join:

SELECT o.id, u.name, p.title
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'DE' AND p.category = 'BOOKS';

Statistics:

  • orders: 10^8 rows, indexed by (user_id) and (product_id)
  • users: 10^6 rows, sel(country='DE') = 0.05, PK id
  • products: 5 * 10^5 rows, sel(category='BOOKS') = 0.1, PK id
  1. Estimate the size of each filtered input.
  2. For each of the two join orders (users ⨝ orders) ⨝ products and (products ⨝ orders) ⨝ users, pick join algorithms and justify.
  3. Compute a rough cost for each plan in terms of page reads.
  4. Which plan wins, and what assumption is that conclusion most sensitive to?

Concurrency Scenarios Log

For each interleaving, state the outcome under (a) strict 2PL and (b) snapshot isolation. If the two disagree, explain which anomaly is admitted or avoided.

  1. T1: r(A), w(A) and T2: r(A), w(A) overlapping.
  2. T1: r(A), w(B) and T2: r(B), w(A) overlapping.
  3. T1: r(rows WHERE status='OPEN'), w(new row with status='OPEN') alongside T2: r(same predicate).
  4. Long read T1 runs for 20 min while many T2..T100 write.
  5. T1: BEGIN; UPDATE A; UPDATE B; COMMIT and T2: BEGIN; UPDATE B; UPDATE A; COMMIT started in parallel.

WAL Recovery Walkthrough

Given the log tail:

100  CHECKPOINT active={T1} dirty={A:recLSN=98}
101 T1 UPDATE A before=0 after=1
102 T2 BEGIN
103 T2 UPDATE B before=0 after=5
104 T1 COMMIT
105 T3 BEGIN
106 T3 UPDATE C before=0 after=9
CRASH
  1. Which transactions are winners? Losers?
  2. Which pages are in the dirty-page table after Analysis?
  3. From which LSN does Redo begin? What records does it reapply?
  4. In what order are loser transactions undone? What CLRs are written?
  5. If a second crash happens during step 4, describe exactly how the next restart proceeds.

Evidence Check

This page is complete only if you can read a real EXPLAIN plan, spot the cost driver, and walk through a crash recovery scenario in sentences -- not in formulas.