Skip to main content

Search-Data-Structure Choice: Array vs Linked List vs Tree

What This Concept Is

Before picking a search algorithm, pick a storage layout. Five basic shapes dominate:

  • Unsorted array -- O(1) append; O(n) search; great cache locality; no ordering structure.
  • Sorted array -- O(log n) search via binary search; O(n) insert/delete because of shifting; cache-friendly and supports range queries in O(log n + k).
  • Linked list -- O(1) insert/delete given a pointer; O(n) search; poor locality; supports streaming / no-reallocation; used mostly inside other structures (free lists, hash bucket chains).
  • Balanced BST (red-black, AVL, treap, skiplist) -- O(log n) search, insert, delete; pointer chasing; supports ordered iteration and predecessor/successor queries; resistant to adversarial keys.
  • Hash table -- expected O(1) search, insert, delete; no ordered operations; memory overhead for load factor; no worst-case guarantee without extra machinery (concept 11).

Two distinctions worth naming explicitly:

  • AVL vs red-black. AVL trees keep (|h_L - h_R| \leq 1) at every node; they are more rigidly balanced and give faster lookups. Red-black trees relax the balance to (h_{\max} \leq 2 h_{\min}); they do fewer rotations on inserts/deletes and are preferred for write-heavy workloads. Java TreeMap and C++ std::map use red-black; in-memory databases sometimes prefer AVL.
  • Skiplist vs tree. A skiplist is a probabilistic layered linked list with expected (O(\log n)) operations. Simpler to implement lock-free than a BST; used in Redis sorted sets, LevelDB memtables, and some lock-free concurrent maps. Asymptotically identical to a balanced BST; differs in concurrency story.

Hash tables belong to the next cluster; they are mentioned here because they are usually the right answer when you do not need order -- and knowing that is part of choosing a container.

Why It Matters Here

Picking the wrong container makes every algorithm on top of it slow. The failure mode is always the same:

  • using a linked list when you want random access
  • using a sorted array when you want frequent insert/delete
  • using a BST when a hash table would do
  • using a hash table when you need order
  • using AVL when a red-black tree suffices, or vice versa

Binary search only earns its O(log n) on a structure that supports O(1) random access. On a linked list, "binary search" degrades to O(n) because finding the middle already costs O(n/2).

This concept is the place where "asymptotic" meets "layout." Every later chapter -- graphs, dynamic programming, priority queues -- depends on matching operation mix to container. It is also where most performance bugs in real systems actually live: "we used the wrong map type" beats "our algorithm is wrong" by a large margin.

Concrete Examples

Example 1 -- phonebook. Implement a phonebook that supports lookup by name, insertion, and sorted traversal.

  • Sorted array: lookup O(log n) via binary search, insert O(n) from shifting, traversal O(n) in order. Wins on read-heavy workloads.
  • Linked list (sorted): lookup O(n), insert O(n) to find spot but O(1) to splice, traversal O(n) in order. Almost never the right choice.
  • Balanced BST: lookup O(log n), insert O(log n), traversal O(n) in order. Wins on mixed workloads.
  • Hash table: lookup O(1) expected, insert O(1) expected, traversal in key order is not supported in O(n). Wins when order is not required.

Picking the container is the architectural decision. The algorithm is a consequence.

Example 2 -- leaderboard with rank queries. Game leaderboard with 10M players needing: insert score O(?), top-10 O(?), query "player X's rank" O(?).

  • Sorted array + binary search: top-10 is O(log n + 10); rank is O(log n); insert is O(n) -- unusable on inserts.
  • Hash table: insert O(1); rank needs sort every query, O(n log n) -- unusable on rank.
  • Balanced BST augmented with subtree size (an "order statistic tree"): insert O(log n), top-10 O(log n + 10), rank O(log n). This is the right answer.
  • Redis ZSET (skiplist + hash): insert O(log n), rank O(log n), top-K O(log n + k). The production incarnation of the order-statistic tree.

The exercise of naming the operations first, then enumerating the containers, is how you avoid the "we grabbed HashMap because it was convenient" bug class.

Common Confusion / Misconceptions

"A linked list is faster than an array for inserts." Only if you already have the pointer to the insert position. If you have to search for the position, the linked list pays (O(n)) twice. On cache-coherent modern CPUs, an array + memmove routinely beats a linked list insert even at (n = 10^4) because of prefetch and branch prediction.

"Hash tables are always better than trees." They do not support range queries (find all keys in [a, b]), nearest-neighbor in ordered data, or ordered traversal without sorting the keys afterward. They also degrade under adversarial input (concept 11) unless the implementation randomizes, which not all do.

"Big-O is the whole story." In practice, an array with linear scan is faster than a hash table for (n \leq \sim 50) because cache and branch prediction dominate. Microbenchmarks on small sets consistently show std::vector<int> linear scan beating std::unordered_map<int> up to a few dozen entries.

"Array means contiguous, fixed-size." "Array" in a high-level language can mean a dynamic array (amortized O(1) append but occasional O(n) resize), a fixed-size array (no append), or a linked-list-like chunked buffer (Python deque, C++ std::deque). Asymptotics depend on which one.

How To Use It

Before choosing a container, list the operations you need and their frequencies:

OperationUnsorted arraySorted arrayLinked listBalanced BSTHash table
Search keyO(n)O(log n)O(n)O(log n)O(1) expected
InsertO(1) amortO(n)O(1)*O(log n)O(1) expected
DeleteO(n)O(n)O(1)*O(log n)O(1) expected
In-order iterateO(n log n) needs sortO(n)O(n)O(n)O(n log n) needs sort
Range queryO(log n + k)O(log n + k)O(n)O(log n + k)O(n)
Min / maxO(n)O(1)O(n) / O(1) tailO(log n) (or O(1) augmented)O(n)
k-th order statO(n) quickselectO(1)O(n)O(log n) augmentedO(n)

* given a pointer to the insertion/deletion position.

Then:

  1. Enumerate the operations you actually need (read your call sites, not the data sheet).
  2. Weight by frequency and by latency sensitivity.
  3. Rank containers by the cost of your hot-path operations.
  4. Rule out any container where a critical operation is unsupported.
  5. Choose the structure whose expensive column you never use often and whose cheap column matches your hot path.
  6. Benchmark -- for small (n), big-O is often misleading.
  7. Document the choice so the next person does not "optimize" by switching to HashMap.

Transfer / Where This Shows Up Later

  • S2 M2 (this module). Binary search (concept 08) assumes random-access sorted array; heaps (concept 14) exploit implicit-tree-in-array layout; the same tradeoffs reappear in the heap priority-queue choice.
  • S2 M5 (advanced structures). Augmented BSTs (order-statistic trees, interval trees, range trees) are all specializations of this general taxonomy, picking what to store in each node.
  • S3 (software design). "Information hiding through the collection interface" -- Java's List / Set / Map hierarchy, C++ STL containers -- is exactly this cluster's tradeoffs codified as interface contracts.
  • S4 (systems programming). Kernel data structures (rbtree for VMAs, hashtable for dcache, linked lists for free lists) are chosen per subsystem by the exact tradeoff analysis above.
  • S5 (OS). Linux uses red-black trees (AVL-like) for CFS scheduler, VMA lookup, and epoll ready lists; hash tables for dcache / inode cache; LRU double-linked lists for page reclamation. Each choice is a concrete instance of this concept.
  • S6 (databases). B+tree for primary indexes (sorted-array + tree hybrid), hash index for equality lookup, skip-list in memtables (RocksDB/LevelDB), heap files for unsorted append-only storage. Query planners pick between "hash join" and "sort-merge join" by cost.
  • S7 (architecture). Registry/catalog patterns: when the data is read-heavy with infrequent writes, use immutable sorted arrays + binary search; when write-heavy, BST. This decision lives in the ADR.
  • S8 (system design). Distributed KV stores (DynamoDB, Cassandra) split into "hash partitioning for equality" vs "range partitioning for range queries"; the choice is the distributed version of hash-table-vs-BST.

Check Yourself

  1. Why does binary search require random access and therefore fail on a linked list?
  2. For a dictionary that is read 1000x more often than written, which structure wins: BST or hash table? Why?
  3. What does a hash table lose compared to a balanced BST?
  4. Why does the Linux kernel use red-black trees for VMA lookup but hash tables for dcache?
  5. When would a skiplist beat a balanced BST in a real system?

Mini Drill or Application

  1. Design the storage for a leaderboard that supports "insert score," "top 10 scores," and "player's rank." Which structure(s) do you pick and why?
  2. Compare sorted array vs balanced BST for a dictionary where inserts and lookups are 50/50. Argue with asymptotics and cache behavior.
  3. Give one workload where an unsorted array beats all other structures in practice.
  4. Implement an order-statistic tree (BST augmented with subtree size) supporting rank(x) and select(k) in (O(\log n)).
  5. Benchmark std::map vs std::unordered_map vs absl::flat_hash_map on lookup-heavy workload with (n = 10^4) and (n = 10^7). Explain the crossovers.
  6. Production scenario. An ad-bidding service keeps a hot in-memory index of 100M advertiser × keyword pairs, must answer equality lookups in under 1 µs p99, and must also enumerate the top 50 bids per keyword for auction close. Enumerate the operations, rank candidate containers (flat hash, BST, skip list, augmented sorted array), pick one, and write the one-paragraph ADR that the next engineer will read.

Read This Only If Stuck