Skip to main content

Module 2: Memory Management & Virtual Memory: Case Studies

These case studies make virtual memory operational: page tables, TLBs, faults, copy-on-write, huge pages, allocators, and thrashing.


Case Study 1: Page Fault Storm After Deploy

Scenario: A service starts slowly after deploy and major page faults spike. The container image changed memory access patterns and cold-starts now touch a much larger working set.

Source anchor: Linux kernel Page Tables explains hierarchical virtual-to-physical address translation.

Module concepts: page table, demand paging, major fault, working set.

Wrong Approach

"A page fault is always a crash."

Better Approach

Classify faults:

minor fault:
mapping exists, no disk I/O

major fault:
requires disk or backing store I/O

fix:
reduce cold working set, warm critical paths, adjust image/data layout

Tradeoff Table

ChoiceGainCost
accept larger cold startno engineering workslower deploy recovery
warm critical pathsfaster readinessextra startup work
reduce working setfewer major faultspossible feature/layout changes

Failure Mode

The deploy changes which pages are touched during startup, so the process triggers a burst of major faults while demand-paging a much larger cold working set.

Project / Capstone Connection

Use this when diagnosing slow startup in containerized services, ML inference apps, or student deployments that load large assets on boot.

Required Artifact

Write a fault diagnosis: minor/major counts, working set, cold path, and mitigation.


Case Study 2: TLB Pressure And Huge Pages

Scenario: An in-memory analytics process scans a 60 GB column store. CPU usage is high but memory bandwidth is not saturated. TLB misses dominate.

Source anchor: Linux Transparent Hugepage Support explains using larger pages to reduce translation overhead.

Module concepts: TLB, page size, huge pages, locality.

Wrong Approach

"More RAM fixes memory performance."

Better Approach

Look at translation cost:

4 KiB pages:
many translations

huge pages:
fewer TLB entries for same memory

risk:
fragmentation and latency spikes from compaction

Tradeoff Table

ChoiceGainCost
stay on 4 KiB pagespredictable behaviormore TLB pressure
enable huge pagesfewer translationsfragmentation risk
selective huge-page usetargeted gainoperational complexity

Failure Mode

The workload fits in memory, but address translation becomes the bottleneck because the TLB cannot cover enough of the active scan efficiently.

Project / Capstone Connection

This is relevant for analytics, search, or simulation capstones that scan large in-memory structures and need to distinguish bandwidth limits from translation limits.

Required Artifact

Write a huge-page decision note with workload, TLB metric, expected gain, and rollback.


Case Study 3: Copy-On-Write Surprise After Fork

Scenario: A server forks workers after loading a large model. Memory looks shared at first, then each worker mutates global caches and RSS grows sharply.

Source anchor: Linux fork(2) documentation explains child creation; copy-on-write behavior is a core virtual-memory mechanism associated with fork. See fork(2).

Module concepts: fork, copy-on-write, RSS, shared pages.

Wrong Approach

"Forked workers share all memory forever."

Better Approach

Keep shared pages read-only where possible:

load model before fork
avoid post-fork writes to shared structures
move mutable caches per worker intentionally
measure RSS/PSS

Tradeoff Table

ChoiceGainCost
fork after preloadstrong initial sharingsensitive to later writes
mutate shared caches post-forksimpler code reuseRSS growth
isolate mutable state per workerpredictable memorymore duplication

Failure Mode

Workers begin writing to previously shared pages, which breaks copy-on-write sharing and causes resident memory to grow per process.

Project / Capstone Connection

Use this when scaling prefork servers, inference workers, or data-processing daemons that load large common state before forking.

Required Artifact

Draw a COW page diagram before and after one worker writes.


Case Study 4: Allocator Fragmentation In A Long-Running Service

Scenario: A service allocates many varied-size objects. Heap usage grows even after request load drops.

Source anchor: malloc behavior is implementation-specific; use mallopt(3) and allocator docs to ground the investigation.

Module concepts: allocator, fragmentation, arenas, RSS, heap profiling.

Wrong Approach

"If objects are freed, memory returns to the OS immediately."

Better Approach

Profile allocation shape:

size classes:
allocation lifetime:
thread arenas:
fragmentation:
RSS vs live heap:

Tradeoff Table

ChoiceGainCost
ignore fragmentationno immediate effortpersistent RSS growth
tune allocator behaviorlower wasteplatform-specific tuning
reshape allocation patternsdurable fixcode changes

Failure Mode

Freed objects leave holes across arenas and size classes, so live heap drops while RSS stays high and reuse becomes inefficient.

Project / Capstone Connection

This fits long-running APIs, brokers, or game servers in capstones where memory growth appears long after the triggering request burst.

Required Artifact

Write an allocator investigation note with allocation profile and mitigation.


Case Study 5: Thrashing From Oversubscribed Memory

Scenario: Five services fit in memory individually but thrash when colocated. CPU drops, disk I/O rises, latency explodes.

Source anchor: Linux memory-management and cgroup docs explain memory pressure and limits. See Linux cgroup v2 memory controller.

Module concepts: working set, thrashing, swapping, cgroup memory, OOM.

Wrong Approach

"CPU is low, so the service is not busy."

Better Approach

Treat memory as the bottleneck:

working set total:
exceeds RAM

symptoms:
major faults, swap I/O, reclaim, OOM kills

fix:
reduce colocated workload or set memory limits

Tradeoff Table

ChoiceGainCost
colocate all serviceshigh utilization on paperthrashing risk
set memory limitsisolationearlier OOM or eviction
spread services across nodesstable latencyinfrastructure cost

Failure Mode

The combined working sets exceed available RAM, so reclaim and swap dominate execution and useful CPU work collapses.

Project / Capstone Connection

Apply this when packing multiple student services onto a shared VM or Kubernetes node and deciding whether memory isolation or placement must change.

Required Artifact

Create a memory pressure report with working set, faults, swap/reclaim, cgroup limit, and placement decision.


Source Map

SourceUse it for
Linux Page Tablesvirtual-to-physical translation
Transparent Hugepage Supporthuge pages and TLB pressure
fork(2)fork and COW context
mallopt(3)allocator tuning surface
Linux cgroup v2memory control and pressure

Completion Standard

  • At least three artifacts are completed.
  • At least one artifact walks address translation or COW.
  • At least one artifact diagnoses faults or memory pressure.