Module 2: Memory Management & Virtual Memory: Case Studies

These case studies make virtual memory operational: page tables, TLBs, faults, copy-on-write, huge pages, allocators, and thrashing.

Case Study 1: Page Fault Storm After Deploy

Scenario: A service starts slowly after deploy and major page faults spike. The container image changed memory access patterns and cold-starts now touch a much larger working set.

Source anchor: Linux kernel Page Tables explains hierarchical virtual-to-physical address translation.

Module concepts: page table, demand paging, major fault, working set.

Wrong Approach

"A page fault is always a crash."

Better Approach

Classify faults:

minor fault:
  mapping exists, no disk I/O

major fault:
  requires disk or backing store I/O

fix:
  reduce cold working set, warm critical paths, adjust image/data layout

Tradeoff Table

Choice	Gain	Cost
accept larger cold start	no engineering work	slower deploy recovery
warm critical paths	faster readiness	extra startup work
reduce working set	fewer major faults	possible feature/layout changes

Failure Mode

The deploy changes which pages are touched during startup, so the process triggers a burst of major faults while demand-paging a much larger cold working set.

Project / Capstone Connection

Use this when diagnosing slow startup in containerized services, ML inference apps, or student deployments that load large assets on boot.

Required Artifact

Write a fault diagnosis: minor/major counts, working set, cold path, and mitigation.

Case Study 2: TLB Pressure And Huge Pages

Scenario: An in-memory analytics process scans a 60 GB column store. CPU usage is high but memory bandwidth is not saturated. TLB misses dominate.

Source anchor: Linux Transparent Hugepage Support explains using larger pages to reduce translation overhead.

Module concepts: TLB, page size, huge pages, locality.

Wrong Approach

"More RAM fixes memory performance."

Better Approach

Look at translation cost:

4 KiB pages:
  many translations

huge pages:
  fewer TLB entries for same memory

risk:
  fragmentation and latency spikes from compaction

Tradeoff Table

Choice	Gain	Cost
stay on 4 KiB pages	predictable behavior	more TLB pressure
enable huge pages	fewer translations	fragmentation risk
selective huge-page use	targeted gain	operational complexity

Failure Mode

The workload fits in memory, but address translation becomes the bottleneck because the TLB cannot cover enough of the active scan efficiently.

Project / Capstone Connection

This is relevant for analytics, search, or simulation capstones that scan large in-memory structures and need to distinguish bandwidth limits from translation limits.

Required Artifact

Write a huge-page decision note with workload, TLB metric, expected gain, and rollback.

Case Study 3: Copy-On-Write Surprise After Fork

Scenario: A server forks workers after loading a large model. Memory looks shared at first, then each worker mutates global caches and RSS grows sharply.

Source anchor: Linux fork(2) documentation explains child creation; copy-on-write behavior is a core virtual-memory mechanism associated with fork. See fork(2).

Module concepts: fork, copy-on-write, RSS, shared pages.

Wrong Approach

"Forked workers share all memory forever."

Better Approach

Keep shared pages read-only where possible:

load model before fork
avoid post-fork writes to shared structures
move mutable caches per worker intentionally
measure RSS/PSS

Tradeoff Table

Choice	Gain	Cost
fork after preload	strong initial sharing	sensitive to later writes
mutate shared caches post-fork	simpler code reuse	RSS growth
isolate mutable state per worker	predictable memory	more duplication

Failure Mode

Workers begin writing to previously shared pages, which breaks copy-on-write sharing and causes resident memory to grow per process.

Project / Capstone Connection

Use this when scaling prefork servers, inference workers, or data-processing daemons that load large common state before forking.

Required Artifact

Draw a COW page diagram before and after one worker writes.

Case Study 4: Allocator Fragmentation In A Long-Running Service

Scenario: A service allocates many varied-size objects. Heap usage grows even after request load drops.

Source anchor: malloc behavior is implementation-specific; use mallopt(3) and allocator docs to ground the investigation.

Module concepts: allocator, fragmentation, arenas, RSS, heap profiling.

Wrong Approach

"If objects are freed, memory returns to the OS immediately."

Better Approach

Profile allocation shape:

size classes:
allocation lifetime:
thread arenas:
fragmentation:
RSS vs live heap:

Tradeoff Table

Choice	Gain	Cost
ignore fragmentation	no immediate effort	persistent RSS growth
tune allocator behavior	lower waste	platform-specific tuning
reshape allocation patterns	durable fix	code changes

Failure Mode

Freed objects leave holes across arenas and size classes, so live heap drops while RSS stays high and reuse becomes inefficient.

Project / Capstone Connection

This fits long-running APIs, brokers, or game servers in capstones where memory growth appears long after the triggering request burst.

Required Artifact

Write an allocator investigation note with allocation profile and mitigation.

Case Study 5: Thrashing From Oversubscribed Memory

Scenario: Five services fit in memory individually but thrash when colocated. CPU drops, disk I/O rises, latency explodes.

Source anchor: Linux memory-management and cgroup docs explain memory pressure and limits. See Linux cgroup v2 memory controller.

Module concepts: working set, thrashing, swapping, cgroup memory, OOM.

Wrong Approach

"CPU is low, so the service is not busy."

Better Approach

Treat memory as the bottleneck:

working set total:
  exceeds RAM

symptoms:
  major faults, swap I/O, reclaim, OOM kills

fix:
  reduce colocated workload or set memory limits

Tradeoff Table

Choice	Gain	Cost
colocate all services	high utilization on paper	thrashing risk
set memory limits	isolation	earlier OOM or eviction
spread services across nodes	stable latency	infrastructure cost

Failure Mode

The combined working sets exceed available RAM, so reclaim and swap dominate execution and useful CPU work collapses.

Project / Capstone Connection

Apply this when packing multiple student services onto a shared VM or Kubernetes node and deciding whether memory isolation or placement must change.

Required Artifact

Create a memory pressure report with working set, faults, swap/reclaim, cgroup limit, and placement decision.

Source Map

Source	Use it for
Linux Page Tables	virtual-to-physical translation
Transparent Hugepage Support	huge pages and TLB pressure
fork(2)	fork and COW context
mallopt(3)	allocator tuning surface
Linux cgroup v2	memory control and pressure

Completion Standard

At least three artifacts are completed.
At least one artifact walks address translation or COW.
At least one artifact diagnoses faults or memory pressure.

Case Study 1: Page Fault Storm After Deploy​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Project / Capstone Connection​

Required Artifact​

Case Study 2: TLB Pressure And Huge Pages​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Project / Capstone Connection​

Required Artifact​

Case Study 3: Copy-On-Write Surprise After Fork​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Project / Capstone Connection​

Required Artifact​

Case Study 4: Allocator Fragmentation In A Long-Running Service​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Project / Capstone Connection​

Required Artifact​

Case Study 5: Thrashing From Oversubscribed Memory​

Wrong Approach​

Better Approach​

Tradeoff Table​

Failure Mode​

Project / Capstone Connection​

Required Artifact​

Source Map​

Completion Standard​

Case Study 1: Page Fault Storm After Deploy

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Project / Capstone Connection

Required Artifact

Case Study 2: TLB Pressure And Huge Pages

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Project / Capstone Connection

Required Artifact

Case Study 3: Copy-On-Write Surprise After Fork

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Project / Capstone Connection

Required Artifact

Case Study 4: Allocator Fragmentation In A Long-Running Service

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Project / Capstone Connection

Required Artifact

Case Study 5: Thrashing From Oversubscribed Memory

Wrong Approach

Better Approach

Tradeoff Table

Failure Mode

Project / Capstone Connection

Required Artifact

Source Map

Completion Standard