Skip to main content

The Union-Find Problem and the Naive Representation

What This Concept Is

The union-find problem (also called disjoint-set union) maintains a partition of n elements into equivalence classes and supports three operations:

  • make-set(x): create a new singleton class containing x
  • find(x): return a canonical representative of x's class
  • union(x, y): merge the classes containing x and y into one

Two elements are in the same class if and only if find(x) == find(y). The representative can be any element chosen consistently; what matters is that calling find twice on the same element (without intervening unions) returns the same value.

The simplest implementation stores each class as a linked list whose head is the representative. Every element stores a pointer to its list head. find is O(1) -- one pointer dereference -- but union has to rewire every pointer in the smaller list, which is O(n) in the worst case. A sequence of n - 1 unions can cost Theta(n^2).

A slightly less naive representation stores each class as an up-tree: each element has a parent pointer; the root points to itself and is the representative. find walks parent pointers to the root (O(h)), union links two roots (O(1)). Unfortunately, without heuristics the trees degenerate into paths and find becomes O(n), trading the pain from union into find instead.

The key conceptual move -- understood before any heuristic -- is that union-find is an online structure for monotone equivalence relations. You can add pairs to the relation via union; you cannot remove them. That monotonicity is precisely what makes the heuristics work.

Why It Matters Here

Union-find is the right data structure whenever a problem can be framed in terms of dynamic equivalence classes:

  • connected components of a graph under edge insertion
  • cycle detection inside Kruskal's MST algorithm
  • variable aliasing and unification in compilers and type checkers
  • image segmentation (pixels belong to the same region)
  • social-graph friend-of-friend grouping in bulk

Before you can appreciate why union by rank plus path compression is "effectively constant," you have to see how bad the naive version is. This concept is the baseline; concept 06 is the acceleration.

Concrete Examples

Example 1 -- linked-list cost blow-up. Start with seven singletons {1}, {2}, ..., {7}. Perform the sequence:

union(1, 2)   -> {1, 2}
union(3, 4) -> {3, 4}
union(1, 3) -> {1, 2, 3, 4}
union(5, 6) -> {5, 6}
union(1, 5) -> {1, 2, 3, 4, 5, 6}
union(1, 7) -> {1, 2, 3, 4, 5, 6, 7}

Under the linked-list representation, each union(1, *) eventually rewrites every element's pointer to 1. The total work is 1 + 1 + 2 + 1 + 4 + 6 = 15 pointer updates for only 6 unions on 7 elements. Generalize: if you always union into the list containing element 1, the k-th union touches k pointers and the total cost is Theta(n^2).

Example 2 -- up-tree degeneration. Using an up-tree representation without heuristics, process union(1, 2), union(2, 3), union(3, 4), ..., union(n-1, n), always linking the second argument under the first. The resulting tree is a path:

1
\
2
\
3
\
.
\
n

find(n) now costs n - 1 pointer dereferences. A sequence of m find operations on n costs Theta(m * n) in the worst case. This is the failure mode union by rank (concept 06) is designed to eliminate.

Common Confusion / Misconceptions

"Naive union-find is linear total work across all unions." No. Each union is worst-case O(n), and the total over n - 1 unions can be Theta(n^2).

"find is free in the linked-list version." It is O(1) because each element directly stores a pointer to its list head, but the expense is pushed entirely into union. You cannot escape the total cost -- the heuristics balance it, not avoid it.

"Union-find can undo a merge if you keep enough metadata." Standard union-find cannot. Once classes are merged, splitting them requires a fundamentally different structure: link-cut trees (O(log n) per operation) or the Holm-Lichtenberg-Thorup fully dynamic connectivity result (O(log^2 n)). If your problem has delete, union-find is not the answer.

"All classes have the same representative before any unions." Classes are disjoint; initially each singleton is its own representative. The choice of representative changes only during union, and it must remain stable for subsequent find calls on elements of that class.

How To Use It

When you think you need union-find, check:

  1. is the equivalence relation only grown, never split?
    • union-find cannot efficiently undo a merge; for that, see link-cut trees
  2. do you actually need the class membership, or just the number of classes?
    • if only the count, keep a counter components that decrements on every successful union
  3. is the structure off-line or on-line?
    • on-line queries are answered as they arrive; off-line union-find can be even faster via Tarjan's off-line LCA algorithm
  4. do you need to carry class metadata (sum, min, max)?
    • store a small aggregate at the root and combine during union; find still returns the root
  5. is your workload dominated by find (then adopt path compression) or union (then union by rank/size)?

Transfer / Where This Shows Up Later

  • S5 (OS): file-system checkers (fsck) use union-find to group blocks into files during recovery; cgroup hierarchies in Linux use an analogous "parent class" abstraction.
  • S6 (databases): transaction isolation levels that detect conflict cycles in snapshot isolation can be phrased as union-find over read-write sets; the pattern recurs in deadlock detection graphs.
  • S7 (architecture): consistent-hashing rings partition tokens into "owner" groups; as nodes join/leave, token ranges coalesce and split -- a common generalization that reaches for union-find as the initial mental model.
  • S8 (scale): distributed gossip protocols (SWIM, Serf) use union-find-like structures to merge "confirmed alive" views across cluster members.

Check Yourself

  1. What three operations define the union-find API?
  2. Why is the linked-list representation's union worst case Theta(n)?
  3. Why can find still be O(1) in the linked-list version?
  4. Name three problems that reduce to union-find.
  5. What online / monotonicity property does union-find rely on, and what breaks if you try to support disunion?
  6. How do you maintain the count of distinct classes without scanning the array after each union?

Mini Drill or Application

  1. Start with singletons {1}, {2}, ..., {10}. Using the linked-list representation, simulate union(1,2), union(3,4), union(1,3), union(5,6), union(1,5), union(7,8), union(1,7), union(9,10), union(1,9). Record the number of pointer updates per union and the total.
  2. Construct a sequence of n - 1 unions that forces the naive up-tree representation into a path; give the exact find cost on the last element.
  3. Prove informally that on n elements no sequence of unions can cost more than O(n^2) in the naive representation.
  4. Implement the linked-list union-find in Python (~40 lines), then compare to O(n^2) by running n = 10^3, 10^4, 10^5 pathological unions and plotting time versus n.
  5. Extend your implementation to track per-class minimum as an aggregate; re-verify correctness on 20 random unions and find-min calls.

Read This Only If Stuck