Statistics, Histograms, and the Cost-Based Optimizer
This generated surface maps a learner-facing curriculum unit to its canonical source routes.
Curriculum surface
- Open learner-facing unit
- Curriculum path:
content/curriculum/architecture/semester-06-databases-distributed/module-02-storage-engines-indexing/concepts/cluster-04-query-execution-and-planning/12-statistics-histograms-and-the-cost-based-optimizer-primary.md - App:
architecture - Semester:
semester-06-databases-distributed - Module:
module-02-storage-engines-indexing - Unit kind:
concept - Curation level:
module_curated
Learning objectives
- Explain Statistics, Histograms, and the Cost-Based Optimizer in terms of physical layout, access paths, and performance tradeoffs instead of memorizing structure names.
- Relate Statistics, Histograms, and the Cost-Based Optimizer to the actual cost of reads, writes, buffering, and maintenance work inside a storage engine.
- Use
database-system-conceptsto connect the learner explanation to B-trees, LSM behavior, query execution, and recovery mechanics.
Prerequisites
- Comfort with the relational model, SQL querying, and basic database terminology from module 01.
Source books
database-system-concepts
Source routes
Database System Concepts
- /books/database-system-concepts via
Database System Concepts: 16.2 Transformation of Relational Expressions (Part 1),Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 1),Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 2),Database System Concepts: 16.4 Choice of Evaluation Plans (Part 1) - /books/database-system-concepts/chapter-26-blockchain-databases via
25..32), assumes uniform within it. That bucket spans7integer values and holds100,000rows, so the estimate is~100,000 / 7 ~= 14,286rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).
If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.
Concrete Example -- Join Cost Walkthrough
SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.
Statistics:
users:|users| = 10^6;sel(country='DE') = 0.05; estimated50,000usersrows after filter.orders:|orders| = 10^8;NDV(user_id) = 10^6; assumed uniform,~100orders per user.
Two plans:
Plan A -- hash join, build on filter(users):
- filter
users->50,000rows, one sequential scan: ~O(|users|) - build hash table:
50,000entries - probe with
orders, sequential scan of10^8, for each row lookup in hash table - output size estimate:
50,000 users * 100 orders/user = 5 * 10^6rows - cost ≈
scan(users) + scan(orders) + hash(50K) + probe(10^8)-> dominated by theordersscan
Plan B -- nested loop, filter users outer, index on orders(user_id):
- filter
users->50,000rows - for each filtered user, index lookup in
ordersreturning~100rows each 50,000 * 100 = 5 * 10^6index-nested-loop fetches
Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.
The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches -- catastrophic.
Common Confusion / Misconception
"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.
"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.
"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.
How To Use It
When a plan looks wrong:
- Run
EXPLAIN (ANALYZE, BUFFERS)and compareestimated rowsvsactual rowsat every node. - Where they disagree by
>10x, that is the source of the bad plan. - Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
- Only after stats are healthy, consider query rewrites or index changes.
- Remember: a "fast enough" plan on bad statistics may silently regress when data grows.
Check Yourself
- Why does a cost-based optimizer need selectivity estimates rather than exact counts?
- Why are equi-depth histograms usually preferred over equi-width?
- How does the independence assumption fail for correlated columns, and what can be done?
Mini Drill or Application
Given a table with 10^7 rows and these statistics:
country:NDV = 200, histogram tells yousel(country='DE') = 0.06status:NDV = 5,sel(status='ACTIVE') = 0.4countryandstatusare correlated: realsel(country='DE' AND status='ACTIVE') = 0.01
- Compute the optimizer's independence-assumption estimate for the combined predicate.
- Compute the true count and the relative error.
- Describe one plan difference that could result from this error.
- Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.
Read This Only If Stuck
- [Database System Concepts: 16.1 Overview
,25..32), assumes uniform within it. That bucket spans7integer values and holds100,000rows, so the estimate is~100,000 / 7 ~= 14,286` rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).
If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.
Concrete Example — Join Cost Walkthrough
SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.
Statistics:
users:|users| = 10^6;sel(country='DE') = 0.05; estimated50,000usersrows after filter.orders:|orders| = 10^8;NDV(user_id) = 10^6; assumed uniform,~100orders per user.
Two plans:
Plan A — hash join, build on filter(users):
- filter
users->50,000rows, one sequential scan: ~O(|users|) - build hash table:
50,000entries - probe with
orders, sequential scan of10^8, for each row lookup in hash table - output size estimate:
50,000 users * 100 orders/user = 5 * 10^6rows - cost ≈
scan(users) + scan(orders) + hash(50K) + probe(10^8)→ dominated by theordersscan
Plan B — nested loop, filter users outer, index on orders(user_id):
- filter
users->50,000rows - for each filtered user, index lookup in
ordersreturning~100rows each 50,000 * 100 = 5 * 10^6index-nested-loop fetches
Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.
The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches — catastrophic.
Common Confusion / Misconception
"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.
"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.
"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.
How To Use It
When a plan looks wrong:
- Run
EXPLAIN (ANALYZE, BUFFERS)and compareestimated rowsvsactual rowsat every node. - Where they disagree by
>10x, that is the source of the bad plan. - Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
- Only after stats are healthy, consider query rewrites or index changes.
- Remember: a "fast enough" plan on bad statistics may silently regress when data grows.
Check Yourself
- Why does a cost-based optimizer need selectivity estimates rather than exact counts?
- Why are equi-depth histograms usually preferred over equi-width?
- How does the independence assumption fail for correlated columns, and what can be done?
Mini Drill or Application
Given a table with 10^7 rows and these statistics:
country:NDV = 200, histogram tells yousel(country='DE') = 0.06status:NDV = 5,sel(status='ACTIVE') = 0.4countryandstatusare correlated: realsel(country='DE' AND status='ACTIVE') = 0.01
- Compute the optimizer's independence-assumption estimate for the combined predicate.
- Compute the true count and the relative error.
- Describe one plan difference that could result from this error.
- Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.
Read This Only If Stuck
- [Database System Concepts: 16.1 Overview`
Supporting curriculum routes
No supporting curriculum routes linked yet.
External enrichment
- PostgreSQL Documentation: Indexes (
official_docs_companion) - Grounds index and execution-planning ideas in a real production database that learners can actually inspect. - PostgreSQL Documentation: Using EXPLAIN (
official_docs_companion) - Useful when the learner needs to connect storage-engine ideas to actual execution plans and cost reasoning.
AI companion modes
- Explain simply
- Socratic tutor
- Challenge my understanding
- Diagnose my confusion
- Connect forward / backward
Source-of-truth note
This teaching unit is learner-facing guidance. Its canonical source backbone is the referenced book database-system-concepts, and outside material should only clarify or strengthen that backbone.