Skip to main content

Persistent Data Structures and Functional Spines

What This Concept Is

A persistent data structure preserves its previous versions when modified. Every update returns a new version; the old version is still queryable. Three flavors:

  • Partially persistent: old versions are read-only; only the latest can be updated
  • Fully persistent: any version can be queried and updated (updates fork the version DAG)
  • Confluently persistent: two versions can be merged (rare; needed for some functional-programming primitives and for VCS merge semantics)

The canonical implementation technique is path copying: when modifying a node in a tree, copy every ancestor along the path from root to that node, reusing all other subtrees by pointer. Memory and time per update become O(log n) instead of O(n) -- far cheaper than copying the entire structure. Because non-modified subtrees are shared across versions via pointer aliasing, the total memory across m versions of a size-n balanced tree is O(n + m log n), not O(m n).

An alternative technique, the fat-node method, stores multiple version stamps at each node. Driscoll, Sarnak, Sleator, and Tarjan (1989) showed that combining fat nodes with a "node splitting" rule gives O(1) amortized overhead for partial persistence and O(log log m) for full persistence -- a tighter bound than path copying but more complex to implement.

Persistent structures are the native representation of data in pure functional languages (Haskell, OCaml, Clojure, Elm). They are also used in version-control systems (Git's commit tree), undo/redo stacks, databases that need snapshot isolation or time-travel queries (CouchDB, Datomic, Git-as-database), and copy-on-write file systems (ZFS, Btrfs, APFS).

Why It Matters Here

Persistence is the fifth distinct technique in this module for controlling data-structure behavior:

  • invariant-driven rebalancing (red-black, AVL)
  • randomization (skip lists, treaps)
  • amortized accounting (Fibonacci heaps, splay trees)
  • probabilistic compression (Bloom filters)
  • structural sharing across versions (persistence)

It also reveals that the shape of the tree matters for more than just search: a balanced tree with O(log n) depth is what makes path-copying cheap. A degenerate BST would cost O(n) per persistent update and O(n^2) memory over n updates -- a good reminder that data-structure choice and persistence interact.

Persistence is the practical bridge between "in-memory data structures" and "immutable distributed storage": Git, Docker image layers, CRDTs, and blockchain state tries (Ethereum's MPT) are all persistent trees in disguise.

Concrete Examples

Example 1 -- persistent BST insert. A persistent binary search tree storing {2, 5, 7}:

Version 0 (v0):
5
/ \
2 7

Now insert 6 to produce v1. The new node 6 is a left child of 7. Path copying creates:

  • a new node 7' with children (left = 6, right = NIL)
  • a new root 5' with left = old 2, right = 7'
v0:           v1:
5 5'
/ \ / \
2 7 2 7'
/
6

The subtree rooted at 2 is shared between v0 and v1. Only the path from root to the insertion point was copied -- O(log n) new nodes.

Querying search(v0, 7) still returns the original 7 node; search(v1, 7) returns 7', which also has 6 as a left child.

Example 2 -- persistent array via balanced 32-ary tree (Clojure vector). A persistent vector of 1,000,000 elements is stored as a log_32(10^6) = 4-level tree of 32-ary branching. Updating index i requires copying only 4 nodes along the path from root to leaf.

Memory cost per update: 4 * 32 * 8 bytes ≈ 1 KB. Query cost: 4 pointer dereferences, each of which likely hits L1/L2 cache. Because log_32 is so shallow, Clojure markets this as "effectively O(1)" even though it is technically O(log n).

Two versions v1 and v2 differing in one element share all but 4 nodes -- roughly 99.99999% memory reuse. This is why immutable Clojure maps and vectors are practical in production: the constant factor of path copying is tiny when the tree is wide.

Common Confusion / Misconceptions

"Persistent means written to disk." Not here. In the data-structures literature, persistent means "preserving history in memory via structural sharing." Disk persistence is a separate concept (durability). Conflating the two is the single biggest source of confusion when discussing Datomic or Git.

"Persistence is free / cheap." Every update costs O(log n) extra time and space to copy the path, versus O(log n) time and O(1) extra space for an ephemeral (destructive) update. The benefit is the old version remains valid; the cost is memory per update. For workloads without a need for history, a mutable structure is strictly better.

"Path copying needs garbage collection." It is natural in garbage-collected languages, but not required. Rust, C++, and Swift implement persistent structures with reference counting (Arc<Node>, shared_ptr<Node>); old versions are freed when no one holds them.

"Path copying is the only way." The fat-node method, the node-copying method, and the version-DAG method all achieve persistence with different amortized bounds (DSST 1989). Pick path copying when you value simplicity; pick fat nodes when you need O(1) amortized overhead.

How To Use It

Path copying recipe for a persistent BST:

persistent_insert(root, key):
if root is NIL:
return new node(key)
if key < root.key:
new_left = persistent_insert(root.left, key)
return new node(root.key, new_left, root.right)
else:
new_right = persistent_insert(root.right, key)
return new node(root.key, root.left, new_right)

Every recursive call allocates at most one new node. Total allocations per insert: O(h) = O(log n) if the tree is balanced.

Other techniques:

  • fat-node method: each node stores multiple versions; O(1) extra per update but complex queries
  • copy-on-write pages: used in file systems (ZFS, Btrfs, APFS) and modern databases to achieve partial persistence at page granularity
  • functional arrays and vectors (Clojure, Scala, Immer): trees of branching factor 32 for O(log_32 n) ≈ effectively O(1) practical persistence
  • persistent segment trees: used in offline range-query problems with version rollbacks; standard in competitive programming

When to use persistent structures:

  • undo/redo history, time-travel debugging
  • multi-version concurrency control (MVCC) in databases
  • functional-programming immutable collections
  • version-control systems (Git, Fossil, Mercurial)
  • snapshot isolation in file systems (ZFS, Btrfs)
  • interview-style graph/tree algorithms that need to "rewind" state
  • research problems requiring history, such as geometric point location queries (Driscoll-Sarnak-Sleator-Tarjan 1989)

Transfer / Where This Shows Up Later

  • S5 (OS): ZFS, Btrfs, and APFS are copy-on-write filesystems -- the "snapshot" feature is nothing but a pointer to an old version of the block tree. Understanding persistent trees demystifies what a filesystem snapshot actually is in memory.
  • S6 (databases): MVCC in PostgreSQL and Oracle preserves row versions for concurrent transactions; Datomic and XTDB are built on append-only, fully persistent indexes; CouchDB stores every revision in a persistent B-tree.
  • S7 (architecture): ADRs for "how should we handle audit logging and time-travel queries?" land directly on persistent trees or event sourcing; Git's branch/merge model is confluent persistence in practice.
  • S8 (scale): Ethereum's Merkle Patricia Trie and IPFS's Merkle DAG are persistent trees that also carry hash-based integrity; CRDTs and operational transformation in collaborative editors (Figma, Google Docs) rely on persistent data structures for conflict-free merges.

Check Yourself

  1. What are partial, full, and confluent persistence?
  2. Why does path copying cost O(log n) per update on a balanced tree?
  3. Why does persistence compose well with immutability in functional languages?
  4. What goes wrong with persistence on a degenerate BST?
  5. Why is a 32-ary branching factor used in Clojure's persistent vectors instead of 2?
  6. How does a Git commit form a partially persistent tree, and why is it not fully persistent by default?

Mini Drill or Application

  1. Draw the version DAG after v0 = {}, v1 = insert(v0, 5), v2 = insert(v1, 3), v3 = insert(v1, 7). Show which nodes are shared.
  2. In v3 above, which subtrees are shared with v1? With v2?
  3. Write a short paragraph about how Git commits form a partially persistent tree of snapshots, and where confluent persistence (merges) enters.
  4. Explain why v1.insert(k) under path copying does not affect v0.
  5. Implement a persistent singly-linked stack in your language of choice: push(v, x) returns a new version without modifying v. Verify that peek(v) remains correct after many pushes on derived versions.

Read This Only If Stuck

There are no dedicated persistent-data-structures chunks in the local book library; the closest reinforcements are the BST and balanced-tree chunks, which establish the structural vocabulary persistence relies on. External references include Okasaki's Purely Functional Data Structures (1999) and the seminal Driscoll-Sarnak-Sleator-Tarjan paper (1989).