Skip to main content

Point Lookups, Range Scans, and Why B+-Trees Dominate OLTP

What This Concept Is

Once you have a B+-tree with order m (roughly m children per internal node) and N leaf entries, three costs fall out:

  • point lookup: descend from root to leaf. Cost: O(log_m N) page reads.
  • range scan of k rows in order: descend once to the start of the range, then walk leaves through the linked list. Cost: O(log_m N + k / entries_per_leaf).
  • insert / delete: descend, modify the leaf, then at worst propagate splits or merges up to the root. Cost: O(log_m N) page writes in the worst case.

The key number is log_m N. With fan-out m ~= 250 (typical for 8 KB pages with short keys), log_250 (10^9) ~= 4. That is why a mature B+-tree on a billion rows is 3 to 5 levels deep.

And the linked leaves are why range scans are cheap: after the initial descent, each subsequent leaf is one sequential link follow, not another tree descent.

Why It Matters Here

OLTP workloads are dominated by:

  • point lookups by primary key (WHERE user_id = ?)
  • small range scans (WHERE ts BETWEEN ? AND ? LIMIT 100)
  • single-row inserts/updates/deletes

A B+-tree does all four well: low constant-factor latency for lookups, cheap linked-leaf ranges, in-place updates that preserve ordering, and a rebalancing story that prevents worst-case drift. Alternatives exist (hash indexes, LSM-trees, skip lists), but none match this mix of capabilities at OLTP scale. That is why B+-trees remain the default primary-index structure in almost every relational engine.

Concrete Example -- Height Math

Assume 8 KB pages, 16-byte keys, 8-byte pointers, and some overhead, giving approximately 300 child pointers per internal node. Leaves hold about 200 entries.

  • N = 10^6: leaves = 10^6 / 200 = 5000. Internal levels: log_300 5000 ~= 1.5, so ~2 internal levels + leaf = height 3.
  • N = 10^9: leaves = 5 * 10^6. Internal levels: log_300 (5*10^6) ~= 2.7, so ~3 + leaf = height 4.

Assume the root and next level are cached (they almost always are). A point lookup then costs 1 cold random I/O (the leaf) plus 1 heap fetch if the index is secondary. That is the whole reason OLTP latency can sit in the low milliseconds on a billion-row table.

Concrete Example -- Range Scan

SELECT * FROM events WHERE ts BETWEEN '2026-01-01' AND '2026-01-02' ORDER BY ts on an index clustered by ts.

  1. Descend from root to the leaf containing the smallest qualifying ts. ~3 page reads (mostly cached).
  2. Walk forward along the leaf linked list, reading one leaf page at a time, emitting rows in order.
  3. Stop at the first entry > '2026-01-02'.

If the range matches 100,000 rows at 200 entries/leaf, that is 500 sequential leaf reads plus some heap fetches. Sequentiality is the win: the same 100,000 rows retrieved by a hash or unordered index would produce 500 random reads for the leaves plus the heap fetches, possibly 100x slower.

Common Confusion / Misconception

"The tree is tall because there are a billion rows." The fan-out is what makes the tree short. The logarithm is base-m, not base-2. A 32-level binary tree and a 4-level B+-tree both address a billion items; the B+-tree gets to 4 levels because each node packs hundreds of children.

"Insertions are always cheap." In the average case yes; in the worst case a leaf split can propagate to the root and the tree can grow by one level. Under heavy random-key insert traffic, split cascades can dominate the write path -- this is exactly the problem LSM-trees (next cluster) were designed to attack.

How To Use It

When reading a query plan that uses a B+-tree index:

  1. Estimate height from table size and fan-out.
  2. Assume the upper 1-2 levels are cached.
  3. Count 1 + 1 I/Os for a secondary-index point lookup (leaf + heap).
  4. Count H + k/entries_per_leaf page reads for an ordered range of k rows.
  5. Compare against the full-scan cost (table size in pages); pick whichever is lower.

This is exactly the arithmetic the optimizer does; doing it yourself calibrates your intuition.

Check Yourself

  1. Why does fan-out matter more than total size for B+-tree height?
  2. Why are range scans cheap relative to k random lookups of the same keys?
  3. Under what workload would a hash index beat a B+-tree for point lookups, and why do most engines still default to B+-trees?

Mini Drill or Application

For each scenario, estimate page reads. Assume height 4, fan-out 300, leaves 200 entries, top 2 levels cached.

  1. Point lookup by PK that returns 1 row (clustered index).
  2. Point lookup by secondary index that returns 1 row.
  3. Range scan returning 10,000 ordered rows (clustered).
  4. Range scan returning 10,000 ordered rows (secondary index + heap fetches, random heap order).
  5. Insert 1 row at a leaf that is 50% full.

Read This Only If Stuck