Statistics, Histograms, and the Cost-Based Optimizer

What This Concept Is

A relational query has many equivalent execution plans. The cost-based optimizer (CBO) enumerates candidate plans and picks the one with the lowest estimated cost. Its inputs are:

the query tree (from the parser)
table and column statistics: row counts, distinct counts (NDV), null fractions, average widths, min/max, histograms
cost constants: per-I/O, per-CPU-op, per-comparison
the physical options available (which indexes exist, which join algorithms the engine implements)

Its output is a plan annotated with estimated cardinalities and estimated costs at every node.

Why statistics matter. Cost is almost entirely a function of estimated row counts. If the optimizer thinks a filter returns 100 rows when it really returns 1,000,000, it may choose nested loop instead of hash join, and your 200 ms query becomes 200 s.

Histograms are how the engine estimates selectivity for non-uniform distributions. An equi-width histogram splits the value range into equal-width buckets; an equi-depth histogram splits so each bucket holds about the same number of rows. Equi-depth is dominant because it handles skew better. For a predicate x = v, the engine locates the bucket containing v and divides by its bucket cardinality to estimate the match count.

For combinations of predicates, the default assumption is independence: sel(A AND B) = sel(A) * sel(B). This is often wrong (correlated columns), and most real optimizers offer multi-column statistics as a fix.

Why It Matters Here

Every earlier cluster assumed "the optimizer picks the right plan." The CBO is how it does -- or fails to. The four practical symptoms of a statistics problem are:

Plans that look right in dev but collapse in production on larger data
A query flipping between two plans based on a parameter value
Nested-loop joins with extreme estimated-vs-actual row mismatches
Queries that get dramatically faster after ANALYZE / UPDATE STATISTICS

Learning to read an EXPLAIN plan and spot an estimated rows = 1, actual rows = 1M mismatch is the single most useful debugging skill in this module.

Concrete Example -- Selectivity with a Histogram

Column age has 10^6 rows, values 0..120, heavy skew around 25-45. Build an equi-depth histogram with 10 buckets, each ~10^5 rows. Suppose the buckets are:

[0..18)   : 100,000 rows
[18..25)  : 100,000 rows
[25..32)  : 100,000 rows      <-- narrower bucket, dense data
[32..38)  : 100,000 rows
[38..45)  : 100,000 rows
[45..55)  : 100,000 rows
[55..65)  : 100,000 rows
[65..75)  : 100,000 rows
[75..90)  : 100,000 rows
[90..120] : 100,000 rows

For age = 30, the estimator finds bucket [25..32), assumes uniform within it. That bucket spans 7 integer values and holds 100,000 rows, so the estimate is ~100,000 / 7 ~= 14,286 rows. The optimizer uses this to pick between index-scan (cheap for small outputs) and sequential-scan (cheap for large).

If instead the histogram were missing or absent, the engine would fall back to a crude default like selectivity = 1 / NDV = 1 / 121 ~= 8264 rows. Close to the same answer on average, but useless for skewed queries like age = 25 where the real count is much higher.

Concrete Example -- Join Cost Walkthrough

SELECT * FROM orders o JOIN users u ON o.user_id = u.id WHERE u.country = 'DE'.

Statistics:

users: |users| = 10^6; sel(country='DE') = 0.05; estimated 50,000 users rows after filter.
orders: |orders| = 10^8; NDV(user_id) = 10^6; assumed uniform, ~100 orders per user.

Two plans:

Plan A -- hash join, build on filter(users):

filter users -> 50,000 rows, one sequential scan: ~O(|users|)
build hash table: 50,000 entries
probe with orders, sequential scan of 10^8, for each row lookup in hash table
output size estimate: 50,000 users * 100 orders/user = 5 * 10^6 rows
cost ≈ scan(users) + scan(orders) + hash(50K) + probe(10^8) -> dominated by the orders scan

Plan B -- nested loop, filter users outer, index on orders(user_id):

filter users -> 50,000 rows
for each filtered user, index lookup in orders returning ~100 rows each
50,000 * 100 = 5 * 10^6 index-nested-loop fetches

Plan B does fewer total I/Os if the orders index is well-clustered and the user count is really 50K. Plan A wins when the user count is much larger or when the index is not helpful.

The CBO picks whichever has the lower estimated cost. If statistics say users.country='DE' matches 1000 rows (bad stats) but actually returns 500,000, Plan B would run 500,000 * 100 = 5 * 10^7 random fetches -- catastrophic.

Common Confusion / Misconception

"Autovacuum keeps stats fresh." Only if configured with reasonable thresholds; heavily updated tables can run with stale statistics between runs. Always check last_analyze when debugging plan regressions.

"More buckets = better estimates." Up to a point. Beyond ~100 buckets, overhead starts to matter and marginal accuracy gain is small. Most engines default to 100 or 200 buckets per column for this reason.

"Independence assumption is fine." Correlated columns (city ↔ state, product ↔ category) routinely produce estimates off by 100x. If you care, create multi-column statistics or use the engine's equivalent feature.

How To Use It

When a plan looks wrong:

Run EXPLAIN (ANALYZE, BUFFERS) and compare estimated rows vs actual rows at every node.
Where they disagree by >10x, that is the source of the bad plan.
Refresh statistics, add missing histograms, or create multi-column stats on correlated filters.
Only after stats are healthy, consider query rewrites or index changes.
Remember: a "fast enough" plan on bad statistics may silently regress when data grows.

Check Yourself

Why does a cost-based optimizer need selectivity estimates rather than exact counts?
Why are equi-depth histograms usually preferred over equi-width?
How does the independence assumption fail for correlated columns, and what can be done?

Mini Drill or Application

Given a table with 10^7 rows and these statistics:

country: NDV = 200, histogram tells you sel(country='DE') = 0.06
status: NDV = 5, sel(status='ACTIVE') = 0.4
country and status are correlated: real sel(country='DE' AND status='ACTIVE') = 0.01

Compute the optimizer's independence-assumption estimate for the combined predicate.
Compute the true count and the relative error.
Describe one plan difference that could result from this error.
Propose a fix using features of PostgreSQL, SQL Server, or Oracle you know.

What This Concept Is​

Why It Matters Here​

Concrete Example -- Selectivity with a Histogram​

Concrete Example -- Join Cost Walkthrough​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​