Copy-on-Write File Systems: ZFS and Btrfs

What This Concept Is

A copy-on-write (COW) file system never overwrites live data. Every update writes new blocks in free space, builds new metadata pointing at the new blocks, and finally updates a single root pointer to publish the new state. Until that root write commits, the old version is intact.

 before update:

   superblock --> root (v1) --> metadata (v1) --> data blocks (v1)

 during update (writing new blocks alongside old):

   superblock --> root (v1) --> metadata (v1) --> data blocks (v1)
                                                   \
                                       new data --> data blocks (v2)
                                           \
                                  new metadata --> metadata (v2) -> (v2 data)
   (nothing references (v2) yet; safe to abandon on crash)

 after atomic root-pointer switch:

   superblock --> root (v2) --> metadata (v2) --> data blocks (v2)
   (old v1 chain is now orphaned; GC can reclaim it)

ZFS and Btrfs are the two production COW file systems on Linux. Both are merkle-tree structured: every parent block holds checksums of its children, so a root checksum mismatch detects corruption anywhere below.

Why It Matters Here

COW replaces journaling with an atomic pointer switch. The invariant becomes trivial: "old or new, never partial." No commit blocks, no replay, no fsck. The trade-offs move elsewhere:

Fragmentation: updates to a contiguous file scatter across free space; sequential reads slow down on HDDs.
Write amplification: updating one block in the middle of a file requires writing a new leaf, possibly a new indirect block, and a new root. On SSDs this is manageable; on tiny devices it hurts.
Garbage collection: old versions must be reclaimed. Reference-counted free blocks (Btrfs) or deferred free lists (ZFS).
Feature density: because pointers are always new, snapshots are nearly free (just keep the old root); clones share blocks until mutated; end-to-end checksums are built in.

COW is also the intellectual pattern behind LSM-tree storage engines, event-sourced systems, and persistent data structures in functional programming. If you internalize it here, it re-appears everywhere.

Concrete Example

Btrfs snapshots: the subvolume root is a node in a B-tree. btrfs subvol snapshot shares all blocks with the current live tree and records a new root. Later writes go to fresh blocks, and only the modified paths diverge.

 time T1:
   live root R1 --- shares ----\
                                +--->  files A, B, C (shared blocks)
   snap root S1 -- shares ---/

 time T2 (A mutated through live tree):
   live root R2 --> A' (new block), B, C
   snap root S1 --> A, B, C     (unchanged)

Deleting the snapshot does not delete A; it just decrements reference counts. A block is freed only when no tree references it.

ZFS uses the same trick for its uberblock (the root pointer). On mount, ZFS looks at several uberblock slots and picks the highest transaction number whose checksum matches. If a crash happens mid-update, the partial uberblock fails its checksum, and ZFS falls back to the previous one.

Common Confusion / Misconception

"COW is a journal by another name." A journal logs intent then applies; COW writes new state in free space and publishes atomically. No replay, no double-write of data in the normal path.

"Btrfs/ZFS never need fsck." They need scrub, which reads everything and verifies checksums. For silent corruption (bit rot, misdirected write, failing cable), scrub is essential. Scrub is different from classical fsck because the structure is always consistent; scrub is about data integrity, not structural repair.

"Snapshots in COW systems take disk space proportional to their size." No. Snapshots take space proportional to the divergence between the snapshot and the current state. A snapshot of a 1 TiB subvolume occupies near-zero extra space until files change.

How To Use It

Given a workload, decide whether COW pays off:

Frequent snapshots, time-travel queries, or immutable versioning: strongly favor COW.
Large-file sequential workloads on HDD with high write throughput: lean toward journaled update-in-place; COW fragmentation can hurt.
Strong data-integrity requirements with no trust in hardware: favor COW with end-to-end checksums (ZFS especially).
Write-heavy small-random-update workload on SSD (typical database): journaled update-in-place is fine; COW's write amplification is unnecessary.

Check Yourself

Why is a snapshot in Btrfs essentially free at creation time?
How does the atomic root-pointer switch actually become atomic in hardware? (Hint: a single block write plus checksum.)
What does it mean for a COW FS to "write in free space," and why does it need a sophisticated allocator?

Mini Drill or Application

Trace through each scenario on Btrfs:

Create a file and its snapshot; overwrite one block in the live file. Which blocks are shared, which are not?
Delete the live file while the snapshot still exists. Do data blocks get freed? Explain via reference counting.
A power loss during the root write. Which state survives?

Then contrast with ext4 data=ordered for the same operations.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​