Statistics, Histograms, and the Cost-Based Optimizer
What This Concept Is
A relational query has many equivalent execution plans. The cost-based optimizer (CBO) enumerates candidate plans and picks the one with the lowest estimated cost. Its inputs are:
- the query tree (from the parser)
- table and column statistics: row counts, distinct counts (
NDV), null fractions, average widths, min/max, histograms - cost constants: per-I/O, per-CPU-op, per-comparison
- the physical options available (which indexes exist, which join algorithms the engine implements)
Its output is a plan annotated with estimated cardinalities and estimated costs at every node.
Why statistics matter. Cost is almost entirely a function of estimated row counts. If the optimizer thinks a filter returns 100 rows when it really returns 1,000,000, it may choose nested loop instead of hash join, and your 200 ms query becomes 200 s.
Histograms are how the engine estimates selectivity for non-uniform distributions. An equi-width histogram splits the value range into equal-width buckets; an equi-depth histogram splits so each bucket holds about the same number of rows. Equi-depth is dominant because it handles skew better. For a predicate x = v, the engine locates the bucket containing v and divides by its bucket cardinality to estimate the match count.
For combinations of predicates, the default assumption is independence: sel(A AND B) = sel(A) * sel(B). This is often wrong (correlated columns), and most real optimizers offer multi-column statistics as a fix.
Why It Matters Here
Every earlier cluster assumed "the optimizer picks the right plan." The CBO is how it does -- or fails to. The four practical symptoms of a statistics problem are:
- Plans that look right in dev but collapse in production on larger data
- A query flipping between two plans based on a parameter value
- Nested-loop joins with extreme estimated-vs-actual row mismatches
- Queries that get dramatically faster after
ANALYZE/UPDATE STATISTICS
Learning to read an EXPLAIN plan and spot an estimated rows = 1, actual rows = 1M mismatch is the single most useful debugging skill in this module.
Concrete Example -- Selectivity with a Histogram
Column age has 10^6 rows, values 0..120, heavy skew around 25-45. Build an equi-depth histogram with 10 buckets, each ~10^5 rows. Suppose the buckets are:
[0..18) : 100,000 rows
[18..25) : 100,000 rows
[25..32) : 100,000 rows <-- narrower bucket, dense data
[32..38) : 100,000 rows
[38..45) : 100,000 rows
[45..55) : 100,000 rows
[55..65) : 100,000 rows
[65..75) : 100,000 rows
[75..90) : 100,000 rows
[90..120] : 100,000 rows
For age = 30, the estimator finds bucket [25..32), assumes uniform within it. That bucket spans 7 integer values and holds 100,000 rows, so the estimate is ~100,000 / 7 ~= 14,286 rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).
If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.
Concrete Example -- Join Cost Walkthrough
SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.
Statistics:
users:|users| = 10^6;sel(country='DE') = 0.05; estimated50,000usersrows after filter.orders:|orders| = 10^8;NDV(user_id) = 10^6; assumed uniform,~100orders per user.
Two plans:
Plan A -- hash join, build on filter(users):
- filter
users->50,000rows, one sequential scan: ~O(|users|) - build hash table:
50,000entries - probe with
orders, sequential scan of10^8, for each row lookup in hash table - output size estimate:
50,000 users * 100 orders/user = 5 * 10^6rows - cost ≈
scan(users) + scan(orders) + hash(50K) + probe(10^8)-> dominated by theordersscan
Plan B -- nested loop, filter users outer, index on orders(user_id):
- filter
users->50,000rows - for each filtered user, index lookup in
ordersreturning~100rows each 50,000 * 100 = 5 * 10^6index-nested-loop fetches
Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.
The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches -- catastrophic.
Common Confusion / Misconception
"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.
"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.
"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.
How To Use It
When a plan looks wrong:
- Run
EXPLAIN (ANALYZE, BUFFERS)and compareestimated rowsvsactual rowsat every node. - Where they disagree by
>10x, that is the source of the bad plan. - Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
- Only after stats are healthy, consider query rewrites or index changes.
- Remember: a "fast enough" plan on bad statistics may silently regress when data grows.
Check Yourself
- Why does a cost-based optimizer need selectivity estimates rather than exact counts?
- Why are equi-depth histograms usually preferred over equi-width?
- How does the independence assumption fail for correlated columns, and what can be done?
Mini Drill or Application
Given a table with 10^7 rows and these statistics:
country:NDV = 200, histogram tells yousel(country='DE') = 0.06status:NDV = 5,sel(status='ACTIVE') = 0.4countryandstatusare correlated: realsel(country='DE' AND status='ACTIVE') = 0.01
- Compute the optimizer's independence-assumption estimate for the combined predicate.
- Compute the true count and the relative error.
- Describe one plan difference that could result from this error.
- Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.
Read This Only If Stuck
- Database System Concepts: 16.1 Overview
- Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 1)
- Database System Concepts: 16.3 Estimating Statistics of Expression Results (Part 2)
- Database System Concepts: 16.4 Choice of Evaluation Plans (Part 1)
- Database System Concepts: 16.2 Transformation of Relational Expressions (Part 1)
- CMU 15-721 Advanced Database Systems: Cost Models