Search-Data-Structure Choice: Array vs Linked List vs Tree
What This Concept Is
Before picking a search algorithm, pick a storage layout. Five basic shapes dominate:
- Unsorted array --
O(1)append;O(n)search; great cache locality; no ordering structure. - Sorted array --
O(log n)search via binary search;O(n)insert/delete because of shifting; cache-friendly and supports range queries inO(log n + k). - Linked list --
O(1)insert/delete given a pointer;O(n)search; poor locality; supports streaming / no-reallocation; used mostly inside other structures (free lists, hash bucket chains). - Balanced BST (red-black, AVL, treap, skiplist) --
O(log n)search, insert, delete; pointer chasing; supports ordered iteration and predecessor/successor queries; resistant to adversarial keys. - Hash table -- expected
O(1)search, insert, delete; no ordered operations; memory overhead for load factor; no worst-case guarantee without extra machinery (concept 11).
Two distinctions worth naming explicitly:
- AVL vs red-black. AVL trees keep (|h_L - h_R| \leq 1) at every node; they are more rigidly balanced and give faster lookups. Red-black trees relax the balance to (h_{\max} \leq 2 h_{\min}); they do fewer rotations on inserts/deletes and are preferred for write-heavy workloads. Java
TreeMapand C++std::mapuse red-black; in-memory databases sometimes prefer AVL. - Skiplist vs tree. A skiplist is a probabilistic layered linked list with expected (O(\log n)) operations. Simpler to implement lock-free than a BST; used in Redis sorted sets, LevelDB memtables, and some lock-free concurrent maps. Asymptotically identical to a balanced BST; differs in concurrency story.
Hash tables belong to the next cluster; they are mentioned here because they are usually the right answer when you do not need order -- and knowing that is part of choosing a container.
Why It Matters Here
Picking the wrong container makes every algorithm on top of it slow. The failure mode is always the same:
- using a linked list when you want random access
- using a sorted array when you want frequent insert/delete
- using a BST when a hash table would do
- using a hash table when you need order
- using AVL when a red-black tree suffices, or vice versa
Binary search only earns its O(log n) on a structure that supports O(1) random access. On a linked list, "binary search" degrades to O(n) because finding the middle already costs O(n/2).
This concept is the place where "asymptotic" meets "layout." Every later chapter -- graphs, dynamic programming, priority queues -- depends on matching operation mix to container. It is also where most performance bugs in real systems actually live: "we used the wrong map type" beats "our algorithm is wrong" by a large margin.
Concrete Examples
Example 1 -- phonebook. Implement a phonebook that supports lookup by name, insertion, and sorted traversal.
- Sorted array: lookup
O(log n)via binary search, insertO(n)from shifting, traversalO(n)in order. Wins on read-heavy workloads. - Linked list (sorted): lookup
O(n), insertO(n)to find spot butO(1)to splice, traversalO(n)in order. Almost never the right choice. - Balanced BST: lookup
O(log n), insertO(log n), traversalO(n)in order. Wins on mixed workloads. - Hash table: lookup
O(1)expected, insertO(1)expected, traversal in key order is not supported inO(n). Wins when order is not required.
Picking the container is the architectural decision. The algorithm is a consequence.
Example 2 -- leaderboard with rank queries. Game leaderboard with 10M players needing: insert score O(?), top-10 O(?), query "player X's rank" O(?).
- Sorted array + binary search: top-10 is
O(log n + 10); rank isO(log n); insert isO(n)-- unusable on inserts. - Hash table: insert
O(1); rank needs sort every query,O(n log n)-- unusable on rank. - Balanced BST augmented with subtree size (an "order statistic tree"): insert
O(log n), top-10O(log n + 10), rankO(log n). This is the right answer. - Redis ZSET (skiplist + hash): insert
O(log n), rankO(log n), top-KO(log n + k). The production incarnation of the order-statistic tree.
The exercise of naming the operations first, then enumerating the containers, is how you avoid the "we grabbed HashMap because it was convenient" bug class.
Common Confusion / Misconceptions
"A linked list is faster than an array for inserts." Only if you already have the pointer to the insert position. If you have to search for the position, the linked list pays (O(n)) twice. On cache-coherent modern CPUs, an array + memmove routinely beats a linked list insert even at (n = 10^4) because of prefetch and branch prediction.
"Hash tables are always better than trees." They do not support range queries (find all keys in [a, b]), nearest-neighbor in ordered data, or ordered traversal without sorting the keys afterward. They also degrade under adversarial input (concept 11) unless the implementation randomizes, which not all do.
"Big-O is the whole story." In practice, an array with linear scan is faster than a hash table for (n \leq \sim 50) because cache and branch prediction dominate. Microbenchmarks on small sets consistently show std::vector<int> linear scan beating std::unordered_map<int> up to a few dozen entries.
"Array means contiguous, fixed-size." "Array" in a high-level language can mean a dynamic array (amortized O(1) append but occasional O(n) resize), a fixed-size array (no append), or a linked-list-like chunked buffer (Python deque, C++ std::deque). Asymptotics depend on which one.
How To Use It
Before choosing a container, list the operations you need and their frequencies:
| Operation | Unsorted array | Sorted array | Linked list | Balanced BST | Hash table |
|---|---|---|---|---|---|
| Search key | O(n) | O(log n) | O(n) | O(log n) | O(1) expected |
| Insert | O(1) amort | O(n) | O(1)* | O(log n) | O(1) expected |
| Delete | O(n) | O(n) | O(1)* | O(log n) | O(1) expected |
| In-order iterate | O(n log n) needs sort | O(n) | O(n) | O(n) | O(n log n) needs sort |
| Range query | O(log n + k) | O(log n + k) | O(n) | O(log n + k) | O(n) |
| Min / max | O(n) | O(1) | O(n) / O(1) tail | O(log n) (or O(1) augmented) | O(n) |
k-th order stat | O(n) quickselect | O(1) | O(n) | O(log n) augmented | O(n) |
* given a pointer to the insertion/deletion position.
Then:
- Enumerate the operations you actually need (read your call sites, not the data sheet).
- Weight by frequency and by latency sensitivity.
- Rank containers by the cost of your hot-path operations.
- Rule out any container where a critical operation is unsupported.
- Choose the structure whose expensive column you never use often and whose cheap column matches your hot path.
- Benchmark -- for small (n), big-O is often misleading.
- Document the choice so the next person does not "optimize" by switching to
HashMap.
Transfer / Where This Shows Up Later
- S2 M2 (this module). Binary search (concept 08) assumes random-access sorted array; heaps (concept 14) exploit implicit-tree-in-array layout; the same tradeoffs reappear in the heap priority-queue choice.
- S2 M5 (advanced structures). Augmented BSTs (order-statistic trees, interval trees, range trees) are all specializations of this general taxonomy, picking what to store in each node.
- S3 (software design). "Information hiding through the collection interface" -- Java's
List/Set/Maphierarchy, C++ STL containers -- is exactly this cluster's tradeoffs codified as interface contracts. - S4 (systems programming). Kernel data structures (rbtree for VMAs, hashtable for dcache, linked lists for free lists) are chosen per subsystem by the exact tradeoff analysis above.
- S5 (OS). Linux uses red-black trees (AVL-like) for CFS scheduler, VMA lookup, and epoll ready lists; hash tables for
dcache/inodecache; LRU double-linked lists for page reclamation. Each choice is a concrete instance of this concept. - S6 (databases). B+tree for primary indexes (sorted-array + tree hybrid), hash index for equality lookup, skip-list in memtables (RocksDB/LevelDB), heap files for unsorted append-only storage. Query planners pick between "hash join" and "sort-merge join" by cost.
- S7 (architecture). Registry/catalog patterns: when the data is read-heavy with infrequent writes, use immutable sorted arrays + binary search; when write-heavy, BST. This decision lives in the ADR.
- S8 (system design). Distributed KV stores (DynamoDB, Cassandra) split into "hash partitioning for equality" vs "range partitioning for range queries"; the choice is the distributed version of hash-table-vs-BST.
Check Yourself
- Why does binary search require random access and therefore fail on a linked list?
- For a dictionary that is read 1000x more often than written, which structure wins: BST or hash table? Why?
- What does a hash table lose compared to a balanced BST?
- Why does the Linux kernel use red-black trees for VMA lookup but hash tables for
dcache? - When would a skiplist beat a balanced BST in a real system?
Mini Drill or Application
- Design the storage for a leaderboard that supports "insert score," "top 10 scores," and "player's rank." Which structure(s) do you pick and why?
- Compare sorted array vs balanced BST for a dictionary where inserts and lookups are 50/50. Argue with asymptotics and cache behavior.
- Give one workload where an unsorted array beats all other structures in practice.
- Implement an order-statistic tree (BST augmented with subtree size) supporting
rank(x)andselect(k)in (O(\log n)). - Benchmark
std::mapvsstd::unordered_mapvsabsl::flat_hash_mapon lookup-heavy workload with (n = 10^4) and (n = 10^7). Explain the crossovers. - Production scenario. An ad-bidding service keeps a hot in-memory index of 100M advertiser × keyword pairs, must answer equality lookups in under 1 µs p99, and must also enumerate the top 50 bids per keyword for auction close. Enumerate the operations, rank candidate containers (flat hash, BST, skip list, augmented sorted array), pick one, and write the one-paragraph ADR that the next engineer will read.
Read This Only If Stuck
- Skiena: 3.1 Contiguous vs Linked Data Structures
- Skiena: 3.1.2 Pointers and Linked Structures
- Skiena: Unsorted vs Sorted
- Skiena: 3.4 Binary Search Trees
- Skiena: 3.4.3 Balanced Search Trees
- CLRS: 10.1 Simple Array-Based Data Structures
- CLRS: 10.2 Linked Lists
- CLRS: 12.1 What Is a Binary Search Tree
- CLRS: 13 Red-Black Trees (properties)
- Sedgewick: Elementary Searching Methods
- Sedgewick: Balanced Trees
- Princeton Sedgewick: 3.2 Binary Search Trees -- sorted-array vs BST tradeoffs with measured benchmarks.
- MIT 6.006 Lecture 7: Binary Search Trees -- BST vs sorted array reasoning.
- VisuAlgo: Binary Search Tree, Hash Table, Linked List -- side-by-side animations of the five container shapes in this concept.
- Chandler Carruth: Efficiency with Algorithms, Performance with Data Structures (CppCon 2014) -- the canonical "array beats linked list" benchmark talk, with cache-level explanation.
- Google SwissTable (abseil) -- the open-addressing design that replaced
std::unordered_mapin many production C++ codebases.