Hash, Bitmap, GIN/GiST: Specialty Indexes
What This Concept Is
B+-trees and LSM-trees are generalists. When the workload has a specific shape -- strict equality, low-cardinality filters, multi-valued columns, or geometric data -- a specialty index can be an order of magnitude faster. The four most useful ones to know:
- Hash index: a hash table mapping key -> RID. Serves only equality predicates. No ordering, no range scans.
O(1)expected lookup. PostgreSQL and SQL Server support them; MySQL InnoDB has an adaptive internal variant. - Bitmap index: for each distinct value in a low-cardinality column, store a bitmap with one bit per row.
AND/OR/NOTof several predicates becomes bitwise operations on the bitmaps. Enormous for analytics on columns likestatus,country,is_active. - GIN (Generalized Inverted Index): for columns whose values contain multiple searchable elements (arrays,
jsonb, tsvector for full-text). Storeselement -> list_of_row_ids. Dominates for "does this value contain X" predicates. - GiST (Generalized Search Tree): an extensible tree framework supporting multi-dimensional and user-defined types. The workhorse for geospatial (
geometry,geography), range types, and nearest-neighbor queries via the<->operator.
These are not replacements for B+-trees. They are the right shape when the predicate or data type does not match B+-tree's ordered-key access pattern.
Why It Matters Here
Every PostgreSQL user has encountered EXPLAIN output showing Bitmap Index Scan or GIN index. Knowing when each applies is the difference between a 50 ms query and a 10 s one.
The general rule: if a B+-tree produces thousands of random heap fetches for one predicate, check whether a different index structure would avoid that.
Concrete Example -- Bitmap Index
Table orders(status TEXT, country TEXT, ...) with 10^8 rows. status has 4 values and country has 200. Query: WHERE status = 'PENDING' AND country = 'DE'.
Under a B+-tree on status, the index matches 25% of rows -- 25 million RIDs. Following each to the heap is catastrophic.
Under bitmaps:
bitmap(status='PENDING') : 10^8 bits, ~25% set
bitmap(country='DE') : 10^8 bits, ~1% set
AND : 10^8 bits, ~0.25% set -> ~250,000 rows
The AND is a straight scan over two bit arrays, highly vectorizable. Then the engine does a bitmap heap scan: iterate set bits, fetch heap rows in page order so I/O becomes sequential. A query that was millions of random reads becomes a few bitmap operations plus ordered heap reads.
Concrete Example -- GIN for jsonb
CREATE INDEX ON docs USING GIN (data jsonb_path_ops);
SELECT * FROM docs WHERE data @> '{"tags": ["urgent"]}';
A B+-tree on data would be useless here -- a JSON blob's comparison order does not match "contains this tag." The GIN index instead indexes each (path, value) pair inside the JSON and can point directly at the rows that contain "tags": ["urgent"].
Concrete Example -- GiST for Geospatial
CREATE INDEX ON places USING GIST (location);
SELECT id FROM places ORDER BY location <-> POINT(-74.0, 40.7) LIMIT 10;
Nearest-neighbor queries over a B+-tree are not well-defined; distance in 2D is not a total order. GiST stores a hierarchy of bounding boxes and prunes the search to the boxes that could contain the nearest k points.
When Each Fits
| Workload | Best structure | Why |
|---|---|---|
| Equality only, high cardinality, no ranges | Hash index | O(1) lookup, minimal space overhead |
| Low-cardinality filters, analytical queries | Bitmap | Bitwise AND/OR across many predicates |
Full-text, arrays, jsonb containment | GIN | Inverted list of embedded tokens |
| Geospatial, ranges, nearest-neighbor | GiST / SP-GiST / BRIN | Multi-dimensional pruning |
| Ordered range scans + point lookups | B+-tree | The generalist default |
| Write-heavy KV, possibly append | LSM-tree | Sequential writes, batched compaction |
Common Confusion / Misconception
"Bitmap indexes are only for columns with two values." The technique works up to a few thousand distinct values; the bitmap count grows with cardinality, but compression (run-length encoded bitmaps like Roaring) keeps storage sub-linear. They fail when cardinality is comparable to the row count -- the bitmaps become as expensive as the rows themselves.
"Hash indexes are always faster than B+-trees for equality." Only until you need a range, an ordered iteration, or durability guarantees the engine does not offer for hash indexes. PostgreSQL's hash indexes were not crash-safe until version 10 -- one reason B+-tree remained the default.
How To Use It
When a predicate is not best served by a B+-tree:
- Identify the predicate shape: equality, containment, spatial, nearest-neighbor, low-cardinality conjunction.
- Match it to the structure above.
- Check whether your engine supports the index kind you want and whether
EXPLAINactually picks it. - Confirm with a cost comparison: sometimes the optimizer needs hints or statistics refresh to choose correctly.
Check Yourself
- Why does a bitmap index dominate a B+-tree for low-cardinality conjunctions but collapse as cardinality grows?
- Why can a hash index answer equality faster than a B+-tree yet still be a bad default?
- Why are GIN and GiST usually described as "frameworks" rather than single algorithms?
Mini Drill or Application
For each workload, pick hash / bitmap / GIN / GiST / B+-tree / LSM and justify in one sentence:
- Append-heavy IoT time-series store with
10^8events/day, mostly range scans by time. - Analytical filters over
country,segment,plan_tieron a5 * 10^8row table. - Full-text search over a
jsonbcolumn with a few dozen path predicates. - Web-app login by
email, nothing else on the column. - Geofencing: find all points within
500 mof a moving vehicle. - Primary key lookup on a customer table.