Copy-on-Write and fork's Efficiency
What This Concept Is
Copy-on-write (CoW) is a virtual-memory trick that lets two address spaces share the same physical pages until one of them writes. The PTEs in both point to the same frame, both marked read-only even though the mapping is logically writable. On a write, the MMU raises a protection fault; the kernel allocates a new frame, copies the page, installs it in the faulting process's PTE as writable, and restarts the instruction.
When Linux does fork(), the kernel does not copy the parent's address space page by page. It duplicates the page-table hierarchy with every writable PTE marked read-only, increments reference counts on every frame, and lets CoW handle actual divergence lazily.
This makes fork cheap no matter how big the process is. The follow-up exec typically replaces most of the address space before any divergence happens, so the copy never occurs.
Why It Matters Here
Several widely-depended-on behaviors are CoW:
fork+execfor starting new programs: cheap because of CoW.posix_spawnoptimizations: often still implemented via CoW under the hood.fork-based concurrency in Python, Ruby (with MRI fork-based servers), Redis (for background saves), PostgreSQL (worker processes), and some language runtimes: effective because CoW keeps shared code and data shared.madvise(MADV_DONTFORK)andmadvise(MADV_WIPEONFORK): knobs that interact with CoW semantics.
Not understanding CoW leads to reasoning errors like: "I forked a 20 GiB process, so now I need 40 GiB of RAM." Usually not; you need 20 GiB + the divergence.
Concrete Example
Redis BGSAVE. A Redis process with 20 GiB of data forks a child. The child is responsible for writing an RDB snapshot to disk. The parent keeps serving commands. Both processes share the 20 GiB via CoW. As the parent writes to pages, those pages diverge. If the parent writes to 10% of the dataset during the save, CoW produces at most 2 GiB of extra RSS. On a 24 GiB machine with 20 GiB of data, this works.
Micro timeline.
parent PID=1000, RSS=20 GiB, writable pages all R/W
|
| fork()
v
parent PID=1000, RSS visible: 20 GiB (shared)
child PID=1001, RSS visible: 20 GiB (shared)
Every page's PTE in both processes is now read-only + CoW.
Physical frames: single copy, refcount=2 each.
|
| parent writes page X
v
MMU: protection fault.
Kernel: allocate new frame X', copy X -> X', mark parent's PTE R/W to X',
decrement X's refcount. Child still maps X.
Reading is free. Both processes can read any page without any divergence. Only writes trigger the copy.
Verifying CoW. Check /proc/$pid/smaps for Shared_Clean, Shared_Dirty, Private_Clean, Private_Dirty. Before divergence, most of the child's mappings show up as Shared_Clean. After divergence, they move to Private_Dirty.
Common Confusion / Misconception
"fork doubles memory usage." Only in the worst case (every page gets written by both sides before exit). In practice, fork + exec is close to free, and server forks with high shared-data read patterns stay near the parent's RSS for a long time.
"CoW means the parent and child are really the same process." They are independent processes with independent PTEs and independent page tables. Only the backing frames are shared, and only until divergence.
"All pages are CoW after fork." Not quite. Pages marked MAP_SHARED are already shared; forks of them do not trigger CoW. MAP_ANONYMOUS | MAP_PRIVATE pages are the classic CoW case. File-backed MAP_PRIVATE pages behave similarly (CoW on write, with the clean version coming from the page cache).
How To Use It
When debugging memory in a forked process:
- Look at
/proc/$pid/smapsforShared_*vsPrivate_*numbers. A bigPrivate_Dirtyin the child indicates heavy divergence. - For
fork-heavy servers, testMADV_WIPEONFORKfor sensitive pages andMADV_DONTFORKfor huge, child-irrelevant regions. - For Redis-like snapshot forks, size your host RAM as
dataset + expected_write_rate * snapshot_duration, not2 * dataset.
When profiling fork cost, remember that the apparent cost is dominated by the page-table duplication and by the first few writes, not by any whole-address-space copy.
Check Yourself
- What does the MMU see on the first write to a CoW page, and what does the kernel do in response?
- Why is
forkcheap even for a 40 GiB process? - What is the difference between
Shared_DirtyandPrivate_Dirtyin/proc/$pid/smaps? - Why does
execafterforkusually make the CoW cost close to zero?
Mini Drill or Application
- A parent has 4 GiB of heap. It forks. The child reads the entire heap but writes nothing. Expected RSS usage? Expected minor faults in the child?
- Same parent. Child writes one byte in every page. Expected extra RSS? Expected fault count?
- On a Linux system, fork a process and examine
/proc/$child/smapsbefore and after the child writes into half of its inherited heap. Summarize what moves from Shared to Private. - Why does CoW interact with transparent huge pages (THP)? (Hint: writing one byte into a 2 MiB page.)
- Describe a pathological workload where
fork+CoW turns into a near-worst-case full copy.