Query Planning and Concurrency Clinic
Retrieval Prompts
- Describe the volcano, vectorized, and JIT execution models in one sentence each.
- State when hash join beats nested loop and when nested loop beats hash join.
- Explain the role of histograms in cost-based optimization.
- State the 2PL rule and why it guarantees serializability.
- Describe what an MVCC snapshot is and how visibility is decided.
- State the write-ahead logging rule in one sentence.
Compare and Distinguish
Separate these pairs clearly:
- volcano iterator vs vectorized execution
- nested loop vs hash join vs sort-merge join
- estimated vs actual row counts in a plan
- strict 2PL vs snapshot isolation
- WAL before-image vs after-image logging
- redo phase vs undo phase in ARIES
Common Mistake Check
For each statement, identify the error:
- "JIT always beats volcano execution."
- "Hash join is always faster than nested loop for large inputs."
- "Histograms are only needed for non-uniform distributions; otherwise NDV is enough."
- "Under MVCC, readers and writers never interact."
- "Snapshot isolation is serializable."
- "WAL doubles storage write cost; sequentiality makes it worse, not better."
EXPLAIN-Reading Drill
For each described plan, list (a) the access path chosen, (b) the join algorithm, (c) the likely cost driver, (d) one change you would consider:
Seq Scan on orders (rows=10^8) -> Hash Join (estimated 5 rows, actual 5*10^7)against a hash-built users table.Index Only Scan on idx_user_id (rows=200)returning exactly the projection columns.Nested Loopwith outerfilter users (rows=10^5)and innerIndex Scan on orders_user_id.Merge JoinwithSortchild on one side andIndex Scanon the other.Bitmap Heap Scanwith twoBitmap Index Scanchildren ANDed together.
Join-Algorithm Cost Walkthrough
Run this cost walkthrough for a three-way join:
SELECT o.id, u.name, p.title
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'DE' AND p.category = 'BOOKS';
Statistics:
orders:10^8rows, indexed by(user_id)and(product_id)users:10^6rows,sel(country='DE') = 0.05, PKidproducts:5 * 10^5rows,sel(category='BOOKS') = 0.1, PKid
- Estimate the size of each filtered input.
- For each of the two join orders
(users ⨝ orders) ⨝ productsand(products ⨝ orders) ⨝ users, pick join algorithms and justify. - Compute a rough cost for each plan in terms of page reads.
- Which plan wins, and what assumption is that conclusion most sensitive to?
Concurrency Scenarios Log
For each interleaving, state the outcome under (a) strict 2PL and (b) snapshot isolation. If the two disagree, explain which anomaly is admitted or avoided.
T1: r(A), w(A)andT2: r(A), w(A)overlapping.T1: r(A), w(B)andT2: r(B), w(A)overlapping.T1: r(rows WHERE status='OPEN'), w(new row with status='OPEN')alongsideT2: r(same predicate).- Long read
T1runs for20 minwhile manyT2..T100write. T1: BEGIN; UPDATE A; UPDATE B; COMMITandT2: BEGIN; UPDATE B; UPDATE A; COMMITstarted in parallel.
WAL Recovery Walkthrough
Given the log tail:
100 CHECKPOINT active={T1} dirty={A:recLSN=98}
101 T1 UPDATE A before=0 after=1
102 T2 BEGIN
103 T2 UPDATE B before=0 after=5
104 T1 COMMIT
105 T3 BEGIN
106 T3 UPDATE C before=0 after=9
CRASH
- Which transactions are winners? Losers?
- Which pages are in the dirty-page table after Analysis?
- From which LSN does Redo begin? What records does it reapply?
- In what order are loser transactions undone? What CLRs are written?
- If a second crash happens during step 4, describe exactly how the next restart proceeds.
Evidence Check
This page is complete only if you can read a real EXPLAIN plan, spot the cost driver, and walk through a crash recovery scenario in sentences -- not in formulas.