Statistics, Histograms, and the Cost-Based Optimizer

This generated surface maps a learner-facing curriculum unit to its canonical source routes.

Curriculum surface

Open learner-facing unit
Curriculum path: content/curriculum/architecture/semester-06-databases-distributed/module-02-storage-engines-indexing/concepts/cluster-04-query-execution-and-planning/12-statistics-histograms-and-the-cost-based-optimizer-primary.md
App: architecture
Semester: semester-06-databases-distributed
Module: module-02-storage-engines-indexing
Unit kind: concept
Curation level: module_curated

Learning objectives

Explain Statistics, Histograms, and the Cost-Based Optimizer in terms of physical layout, access paths, and performance tradeoffs instead of memorizing structure names.
Relate Statistics, Histograms, and the Cost-Based Optimizer to the actual cost of reads, writes, buffering, and maintenance work inside a storage engine.
Use database-system-concepts to connect the learner explanation to B-trees, LSM behavior, query execution, and recovery mechanics.

Prerequisites

Comfort with the relational model, SQL querying, and basic database terminology from module 01.

Source books

database-system-concepts

Source routes

Database System Concepts

/books/database-system-concepts via Database System Concepts: 16.2 Transformation of Relational Expressions (Part 1), Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 1), Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 2), Database System Concepts: 16.4 Choice of Evaluation Plans (Part 1)
/books/database-system-concepts/chapter-26-blockchain-databases via 25..32), assumes uniform within it. That bucket spans 7 integer values and holds 100,000 rows, so the estimate is ~100,000 / 7 ~= 14,286 rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).

If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.

Concrete Example -- Join Cost Walkthrough

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.

Statistics:

users: |users| = 10^6; sel(country='DE') = 0.05; estimated 50,000 users rows after filter.
orders: |orders| = 10^8; NDV(user_id) = 10^6; assumed uniform, ~100 orders per user.

Two plans:

Plan A -- hash join, build on filter(users):

filter users -> 50,000 rows, one sequential scan: ~O(|users|)
build hash table: 50,000 entries
probe with orders, sequential scan of 10^8, for each row lookup in hash table
output size estimate: 50,000 users * 100 orders/user = 5 * 10^6 rows
cost ≈ scan(users) + scan(orders) + hash(50K) + probe(10^8) -> dominated by the orders scan

Plan B -- nested loop, filter users outer, index on orders(user_id):

filter users -> 50,000 rows
for each filtered user, index lookup in orders returning ~100 rows each
50,000 * 100 = 5 * 10^6 index-nested-loop fetches

Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.

The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches -- catastrophic.

Common Confusion / Misconception

"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.

"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.

"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.

How To Use It

When a plan looks wrong:

Run EXPLAIN (ANALYZE, BUFFERS) and compare estimated rows vs actual rows at every node.
Where they disagree by >10x, that is the source of the bad plan.
Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
Only after stats are healthy, consider query rewrites or index changes.
Remember: a "fast enough" plan on bad statistics may silently regress when data grows.

Check Yourself

Why does a cost-based optimizer need selectivity estimates rather than exact counts?
Why are equi-depth histograms usually preferred over equi-width?
How does the independence assumption fail for correlated columns, and what can be done?

Mini Drill or Application

Given a table with 10^7 rows and these statistics:

country: NDV = 200, histogram tells you sel(country='DE') = 0.06
status: NDV = 5, sel(status='ACTIVE') = 0.4
country and status are correlated: real sel(country='DE' AND status='ACTIVE') = 0.01

Compute the optimizer's independence-assumption estimate for the combined predicate.
Compute the true count and the relative error.
Describe one plan difference that could result from this error.
Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.

Read This Only If Stuck

[Database System Concepts: 16.1 Overview, 25..32), assumes uniform within it. That bucket spans 7integer values and holds100,000rows, so the estimate is~100,000 / 7 ~= 14,286` rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).

Concrete Example — Join Cost Walkthrough

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.

Statistics:

users: |users| = 10^6; sel(country='DE') = 0.05; estimated 50,000 users rows after filter.
orders: |orders| = 10^8; NDV(user_id) = 10^6; assumed uniform, ~100 orders per user.

Two plans:

Plan A — hash join, build on filter(users):

filter users -> 50,000 rows, one sequential scan: ~O(|users|)
build hash table: 50,000 entries
probe with orders, sequential scan of 10^8, for each row lookup in hash table
output size estimate: 50,000 users * 100 orders/user = 5 * 10^6 rows
cost ≈ scan(users) + scan(orders) + hash(50K) + probe(10^8) → dominated by the orders scan

Plan B — nested loop, filter users outer, index on orders(user_id):

filter users -> 50,000 rows
for each filtered user, index lookup in orders returning ~100 rows each
50,000 * 100 = 5 * 10^6 index-nested-loop fetches

Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.

Common Confusion / Misconception

How To Use It

When a plan looks wrong:

Run EXPLAIN (ANALYZE, BUFFERS) and compare estimated rows vs actual rows at every node.
Where they disagree by >10x, that is the source of the bad plan.
Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
Only after stats are healthy, consider query rewrites or index changes.
Remember: a "fast enough" plan on bad statistics may silently regress when data grows.

Check Yourself

Why does a cost-based optimizer need selectivity estimates rather than exact counts?
Why are equi-depth histograms usually preferred over equi-width?
How does the independence assumption fail for correlated columns, and what can be done?

Mini Drill or Application

Given a table with 10^7 rows and these statistics:

country: NDV = 200, histogram tells you sel(country='DE') = 0.06
status: NDV = 5, sel(status='ACTIVE') = 0.4
country and status are correlated: real sel(country='DE' AND status='ACTIVE') = 0.01

Compute the optimizer's independence-assumption estimate for the combined predicate.
Compute the true count and the relative error.
Describe one plan difference that could result from this error.
Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.

Read This Only If Stuck

[Database System Concepts: 16.1 Overview`

Supporting curriculum routes

No supporting curriculum routes linked yet.

External enrichment

PostgreSQL Documentation: Indexes (official_docs_companion) - Grounds index and execution-planning ideas in a real production database that learners can actually inspect.
PostgreSQL Documentation: Using EXPLAIN (official_docs_companion) - Useful when the learner needs to connect storage-engine ideas to actual execution plans and cost reasoning.

AI companion modes

Explain simply
Socratic tutor
Challenge my understanding
Diagnose my confusion
Connect forward / backward

Source-of-truth note

This teaching unit is learner-facing guidance. Its canonical source backbone is the referenced book database-system-concepts, and outside material should only clarify or strengthen that backbone.

Curriculum surface​

Learning objectives​

Prerequisites​

Source books​

Source routes​

Database System Concepts​

Concrete Example -- Join Cost Walkthrough​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​

Concrete Example — Join Cost Walkthrough​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​

Supporting curriculum routes​

External enrichment​

AI companion modes​

Source-of-truth note​

Curriculum surface

Learning objectives

Prerequisites

Source books

Source routes

Database System Concepts

Concrete Example -- Join Cost Walkthrough

Common Confusion / Misconception

How To Use It

Check Yourself

Mini Drill or Application

Read This Only If Stuck

Concrete Example — Join Cost Walkthrough

Common Confusion / Misconception

How To Use It

Check Yourself

Mini Drill or Application

Read This Only If Stuck

Supporting curriculum routes

External enrichment

AI companion modes

Source-of-truth note