Skip to main content

Statistics, Histograms, and the Cost-Based Optimizer

This generated surface maps a learner-facing curriculum unit to its canonical source routes.

Curriculum surface

  • Open learner-facing unit
  • Curriculum path: content/curriculum/architecture/semester-06-databases-distributed/module-02-storage-engines-indexing/concepts/cluster-04-query-execution-and-planning/12-statistics-histograms-and-the-cost-based-optimizer-primary.md
  • App: architecture
  • Semester: semester-06-databases-distributed
  • Module: module-02-storage-engines-indexing
  • Unit kind: concept
  • Curation level: module_curated

Learning objectives

  • Explain Statistics, Histograms, and the Cost-Based Optimizer in terms of physical layout, access paths, and performance tradeoffs instead of memorizing structure names.
  • Relate Statistics, Histograms, and the Cost-Based Optimizer to the actual cost of reads, writes, buffering, and maintenance work inside a storage engine.
  • Use database-system-concepts to connect the learner explanation to B-trees, LSM behavior, query execution, and recovery mechanics.

Prerequisites

  • Comfort with the relational model, SQL querying, and basic database terminology from module 01.

Source books

  • database-system-concepts

Source routes

Database System Concepts

  • /books/database-system-concepts via Database System Concepts: 16.2 Transformation of Relational Expressions (Part 1), Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 1), Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 2), Database System Concepts: 16.4 Choice of Evaluation Plans (Part 1)
  • /books/database-system-concepts/chapter-26-blockchain-databases via 25..32), assumes uniform within it. That bucket spans 7 integer values and holds 100,000 rows, so the estimate is ~100,000 / 7 ~= 14,286 rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).

If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.

Concrete Example -- Join Cost Walkthrough

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.

Statistics:

  • users: |users| = 10^6; sel(country='DE') = 0.05; estimated 50,000 users rows after filter.
  • orders: |orders| = 10^8; NDV(user_id) = 10^6; assumed uniform, ~100 orders per user.

Two plans:

Plan A -- hash join, build on filter(users):

  • filter users -> 50,000 rows, one sequential scan: ~O(|users|)
  • build hash table: 50,000 entries
  • probe with orders, sequential scan of 10^8, for each row lookup in hash table
  • output size estimate: 50,000 users * 100 orders/user = 5 * 10^6 rows
  • cost ≈ scan(users) + scan(orders) + hash(50K) + probe(10^8) -> dominated by the orders scan

Plan B -- nested loop, filter users outer, index on orders(user_id):

  • filter users -> 50,000 rows
  • for each filtered user, index lookup in orders returning ~100 rows each
  • 50,000 * 100 = 5 * 10^6 index-nested-loop fetches

Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.

The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches -- catastrophic.

Common Confusion / Misconception

"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.

"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.

"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.

How To Use It

When a plan looks wrong:

  1. Run EXPLAIN (ANALYZE, BUFFERS) and compare estimated rows vs actual rows at every node.
  2. Where they disagree by >10x, that is the source of the bad plan.
  3. Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
  4. Only after stats are healthy, consider query rewrites or index changes.
  5. Remember: a "fast enough" plan on bad statistics may silently regress when data grows.

Check Yourself

  1. Why does a cost-based optimizer need selectivity estimates rather than exact counts?
  2. Why are equi-depth histograms usually preferred over equi-width?
  3. How does the independence assumption fail for correlated columns, and what can be done?

Mini Drill or Application

Given a table with 10^7 rows and these statistics:

  • country: NDV = 200, histogram tells you sel(country='DE') = 0.06
  • status: NDV = 5, sel(status='ACTIVE') = 0.4
  • country and status are correlated: real sel(country='DE' AND status='ACTIVE') = 0.01
  1. Compute the optimizer's independence-assumption estimate for the combined predicate.
  2. Compute the true count and the relative error.
  3. Describe one plan difference that could result from this error.
  4. Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.

Read This Only If Stuck

  • [Database System Concepts: 16.1 Overview, 25..32), assumes uniform within it. That bucket spans 7integer values and holds100,000rows, so the estimate is~100,000 / 7 ~= 14,286` rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).

If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.

Concrete Example — Join Cost Walkthrough

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.

Statistics:

  • users: |users| = 10^6; sel(country='DE') = 0.05; estimated 50,000 users rows after filter.
  • orders: |orders| = 10^8; NDV(user_id) = 10^6; assumed uniform, ~100 orders per user.

Two plans:

Plan A — hash join, build on filter(users):

  • filter users -> 50,000 rows, one sequential scan: ~O(|users|)
  • build hash table: 50,000 entries
  • probe with orders, sequential scan of 10^8, for each row lookup in hash table
  • output size estimate: 50,000 users * 100 orders/user = 5 * 10^6 rows
  • cost ≈ scan(users) + scan(orders) + hash(50K) + probe(10^8) → dominated by the orders scan

Plan B — nested loop, filter users outer, index on orders(user_id):

  • filter users -> 50,000 rows
  • for each filtered user, index lookup in orders returning ~100 rows each
  • 50,000 * 100 = 5 * 10^6 index-nested-loop fetches

Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.

The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches — catastrophic.

Common Confusion / Misconception

"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.

"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.

"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.

How To Use It

When a plan looks wrong:

  1. Run EXPLAIN (ANALYZE, BUFFERS) and compare estimated rows vs actual rows at every node.
  2. Where they disagree by >10x, that is the source of the bad plan.
  3. Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
  4. Only after stats are healthy, consider query rewrites or index changes.
  5. Remember: a "fast enough" plan on bad statistics may silently regress when data grows.

Check Yourself

  1. Why does a cost-based optimizer need selectivity estimates rather than exact counts?
  2. Why are equi-depth histograms usually preferred over equi-width?
  3. How does the independence assumption fail for correlated columns, and what can be done?

Mini Drill or Application

Given a table with 10^7 rows and these statistics:

  • country: NDV = 200, histogram tells you sel(country='DE') = 0.06
  • status: NDV = 5, sel(status='ACTIVE') = 0.4
  • country and status are correlated: real sel(country='DE' AND status='ACTIVE') = 0.01
  1. Compute the optimizer's independence-assumption estimate for the combined predicate.
  2. Compute the true count and the relative error.
  3. Describe one plan difference that could result from this error.
  4. Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.

Read This Only If Stuck

  • [Database System Concepts: 16.1 Overview`

Supporting curriculum routes

No supporting curriculum routes linked yet.

External enrichment

AI companion modes

  • Explain simply
  • Socratic tutor
  • Challenge my understanding
  • Diagnose my confusion
  • Connect forward / backward

Source-of-truth note

This teaching unit is learner-facing guidance. Its canonical source backbone is the referenced book database-system-concepts, and outside material should only clarify or strengthen that backbone.