Collision Resolution: Chaining vs Open Addressing
What This Concept Is
When two keys hash to the same bucket you must resolve the collision. The two classical schemes:
- Chaining: each bucket holds a linked list (or small array, or tree) of entries. Insert appends; search walks the list.
- Open addressing: every entry lives directly in the bucket array. On collision, follow a probe sequence (h(k, 0), h(k, 1), h(k, 2), \dots) until you find the key or an empty slot.
Common open-addressing probe sequences:
- Linear probing: (h(k, i) = (h(k) + i) \bmod m) -- simplest, cache-friendly, suffers "primary clustering."
- Quadratic probing: (h(k, i) = (h(k) + c_1 i + c_2 i^2) \bmod m) -- avoids primary clustering but needs care on
mto guarantee full coverage. - Double hashing: (h(k, i) = (h_1(k) + i \cdot h_2(k)) \bmod m) -- two independent hash functions, best theoretical behavior among the three.
- Robin Hood hashing: on collision, "take from the rich and give to the poor" -- displace the existing entry if it has a shorter probe distance than the new key. Caps worst-case probe length.
- Cuckoo hashing: two hash functions and two tables; each key lives in one of two fixed slots. Worst-case (O(1)) lookup, amortized (O(1)) insert.
Under simple uniform hashing with load factor (\alpha = n / m):
- Chaining: expected search cost is (1 + \alpha); degrades gracefully when (\alpha > 1)
- Open addressing (successful search): expected (\frac{1}{\alpha} \ln \frac{1}{1-\alpha}); (unsuccessful search): (\frac{1}{1 - \alpha}); catastrophic as (\alpha \to 1)
Modern SIMD-parallel open addressing (Google SwissTable, Rust hashbrown, Facebook F14) uses 16-byte "control bytes" scanned in parallel with SSE instructions, giving sub-nanosecond lookup on cached tables.
Why It Matters Here
Hash-table performance is entirely about collision strategy. The two schemes trade different things:
- chaining wastes memory on pointers but handles high load gracefully
- open addressing is cache-friendly (entries packed in contiguous memory) but performance collapses near full load
- open addressing cannot handle load factor (> 1); chaining can
- chaining has simple deletion; open addressing requires tombstones
Modern production hash tables pick a mix based on use case: Java HashMap uses chaining with red-black-tree fallback under collision load; Python dict uses open addressing with perturbation-based probing (the "perturb" trick mixes the full hash into the probe sequence); Go's map uses bucketed chaining with 8-way fan-out; Google's SwissTable uses SIMD-parallel linear probing with group-scan.
Knowing which scheme is in your library changes how you configure capacity, when you resize, and how adversarial input damages you. It also changes your p99 latency profile -- chaining with tree fallback has bounded worst case; open addressing has degrading but predictable probe length.
Concrete Examples
Example 1 -- side-by-side: chaining vs linear probing. Insert keys 10, 22, 31, 4, 15, 28, 17 into a table of size (m = 11) using (h(k) = k \bmod 11).
Hashes: 10->10, 22->0, 31->9, 4->4, 15->4, 28->6, 17->6.
Chaining (each bucket a list):
0: [22]
4: [4, 15]
6: [28, 17]
9: [31]
10: [10]
Lookup for 17 visits bucket 6 and walks past 28 -> 2 steps.
Linear probing (open addressing):
10->10, 22->0, 31->9, 4->4, 15->4 occupied, probe 5 -> 15@5
28->6, 17->6 occupied, probe 7 -> 17@7
Final table indices: 0:22, 4:4, 5:15, 6:28, 7:17, 9:31, 10:10. Lookup for 17 probes 6 then 7.
With linear probing, keys around a collision get pushed farther and farther -- this is "primary clustering" and makes later lookups slower than chaining predicts.
Example 2 -- expected probe count vs load factor. At (\alpha = 0.5): chaining averages (1.5) probes per successful search; linear probing averages (\frac{1}{2}(1 + \frac{1}{1-0.5}) = 1.5). At (\alpha = 0.9): chaining averages (1.9); linear probing averages (\frac{1}{2}(1 + \frac{1}{0.1}) = 5.5). At (\alpha = 0.99): chaining averages (1.99); linear probing averages (\frac{1}{2}(1 + 100) = 50.5). The divergence at high load is why production open-addressing tables aggressively resize near (\alpha = 0.75).
Common Confusion / Misconceptions
"Open addressing is always faster because no pointers." Open addressing beats chaining on cache at low load factor. At load factor above (\sim 0.7), clustering in linear probing and probe-sequence length in general open addressing can make it much slower than chaining. The crossover depends on key size and hash quality.
"Deletion is easy in open addressing." Simply erasing a slot breaks probe sequences -- future lookups for a later key in the same cluster see "empty" and stop. Real implementations use a tombstone marker or lazy rehash. Chaining has no such issue; delete is a list removal. Robin Hood hashing supports backward-shift deletion that avoids tombstones but requires extra bookkeeping.
"Quadratic probing visits every slot." Quadratic probing without a careful m choice can fail to visit every slot, leaving insertion stuck even when the table is not full. Classic requirement: (m) prime and (c_1 = c_2 = 1/2), or m a power of 2 with triangular numbers (T_i = i(i+1)/2), which is guaranteed to visit all slots.
"Cuckoo hashing guarantees (O(1)) worst-case everything." Worst-case (O(1)) lookup only. Insertions can cycle (a "cuckoo loop") and trigger a full rehash. Amortized cost is still (O(1)) with high probability as long as (\alpha < 0.5) (two-way) or (\alpha < 0.9) (with stash / higher fan-out).
How To Use It
Pick chaining when:
- load factor might temporarily exceed (1)
- you need simple deletion semantics
- the per-entry overhead of a pointer is acceptable
- you want tree-fallback for adversarial resistance (Java 8 style)
Pick open addressing when:
- memory and cache dominate (tight packing matters, e.g., small integer keys)
- load factor is controlled below (\sim 0.75)
- you can handle tombstones or periodic rehash
- you want SIMD-parallel probes (SwissTable-style)
Pick specialized variants when:
- Robin Hood hashing -- you want bounded worst-case probe length
- Cuckoo hashing -- you need worst-case (O(1)) lookup (real-time systems, networking packet classification)
Always:
- pick (m) to be a prime or a power of two with a quality hash function
- resize when the table crosses its target load factor
- for open addressing, use double hashing or Robin Hood over plain linear probing if clustering is a concern
- for adversarial input, randomize the hash function seed at table creation (keyed SipHash is the canonical choice)
Transfer / Where This Shows Up Later
- S2 M2 (this module). Concept 13 uses load-factor arithmetic to decide resizing; concept 11's expected-case analysis relies on these collision resolution strategies.
- S2 M3 (graphs). Hash sets for BFS/DFS "visited" tracking typically use chaining with tree fallback when graphs are adversarial.
- S3 (software design). "Open addressing vs chaining" is the canonical pair for "struct-of-arrays vs array-of-structs" cache performance discussion in systems-oriented codebases.
- S4 (systems programming). CPU L1/L2 set-associative caches are hardware open-addressing hash tables -- each cache set has a fixed number of ways, and the cache line "probes" across them. TLB and branch-target buffers use the same pattern.
- S5 (OS). Linux's
dcacheuses chaining with RCU. The slab allocator uses bucketed free lists (chaining). Route tables often use open addressing. - S6 (databases). Hash-join inner-table build uses open addressing with linear probing because the table is built once and scanned hot. LSM-tree memtables use skiplists (conceptually chaining over a probabilistic index). Bloom filters use (k) independent hash functions -- a close relative of double hashing.
- S7 / S8 (architecture, system design). Rate-limiter hash tables pick open addressing for tight cache behavior. Consistent hashing rings use binary search over a sorted array -- the "give up chaining for cache-friendly scan" pattern.
Check Yourself
- Why is deletion harder in open addressing than in chaining?
- What is "primary clustering" and what probe scheme causes it?
- At load factor (\alpha = 0.9), which scheme degrades faster -- chaining or open addressing?
- Why does SwissTable use "control bytes" and SIMD-parallel scanning, and what load factor does it target?
- What is a cuckoo loop, and how do implementations detect and recover from it?
Mini Drill or Application
- Insert keys
[24, 11, 37, 50, 7, 18, 30]into a table of size 7 with (h(k) = k \bmod 7) using (a) chaining and (b) linear probing. Compare the final states. - Derive the expected search cost for chaining at load factor (\alpha) using linearity of expectation.
- Demonstrate with a small example why a naive delete in linear probing breaks a later lookup.
- Implement Robin Hood hashing with backward-shift delete. Measure the max probe length distribution on 1M random inserts and compare against plain linear probing.
- Implement a toy cuckoo hash with two tables of size (n); insert until you hit a cycle, then rebuild. Measure the insertion failure rate as (\alpha \to 0.5).
Read This Only If Stuck
- CLRS: 11.2 Hash Tables (chaining details)
- CLRS: 11.2 Hash Tables (chaining analysis)
- CLRS: 11.4 Open Addressing
- CLRS: 11.4 Open Addressing (continued)
- CLRS: 11.5 Practical Considerations
- Sedgewick: Hashing
- Sedgewick: Hashing (collision strategies)
- Grokking Algorithms: Collisions, Performance, Load Factor
- Skiena: 3.7 Hashing and Strings
- Princeton Sedgewick: 3.4 Hash Tables (separate chaining vs linear probing) -- both schemes implemented, analyzed, and benchmarked.
- MIT 6.006 Lecture 9: Table Doubling, Karp-Rabin -- open addressing treated with the I/O-efficient analysis.
- Google SwissTable design notes (CppCon 2017) -- the canonical treatment of SIMD-parallel open addressing.