Allocator Comparison Clinic

Where abstract allocator descriptions become measured differences in RSS, throughput, and tail latency.

Retrieval Prompts

State the primary design difference between ptmalloc, jemalloc, and tcmalloc.
State what LD_PRELOAD does and why it is useful for allocator testing.
State what MALLOC_ARENA_MAX controls in glibc.
State the difference between virtual size, RSS, and actually-in-use bytes.
Explain why a service's RSS can grow while its working-set stays flat.

Compare and Distinguish

Separate these:

internal fragmentation versus external fragmentation
allocator-level caching versus kernel page-cache
RSS versus "live-object bytes"
free returning memory to the allocator versus to the kernel
thread-local caches versus per-arena locks

Common Mistake Check

Identify the error:

"I called free, so the RSS should drop immediately."
"glibc malloc is slow because it is old; newer allocators will always be faster."
"Switching allocators is risky because it changes program behavior."
"Valgrind says no leaks, so the growing RSS is not caused by allocations."
"The allocator only matters for multithreaded programs."

Mini Application

Lab A: Single-threaded microbenchmark

Write a program that:

Allocates and frees 10 million objects of sizes chosen from a distribution (some uniform small like 32/64/128 bytes, some long-tail large).
Measures total wall-clock time and RSS at the end.

Run under:

default glibc
jemalloc via LD_PRELOAD=$(jemalloc-config --libdir)/libjemalloc.so.$(jemalloc-config --revision)
tcmalloc via LD_PRELOAD=/usr/lib/libtcmalloc.so (path may vary)

Record: throughput (allocs/sec), peak RSS, RSS at exit.

Lab B: Multithreaded contention

Extend the program above to N threads, each doing the allocation/free loop independently. Measure throughput and per-thread stalls (perf stat -e task-clock,context-switches,cpu-migrations).

Under glibc, try setting MALLOC_ARENA_MAX=1 and MALLOC_ARENA_MAX=N to see how arenas affect contention.

Compare glibc vs jemalloc vs tcmalloc on 1, 4, 16, 64 threads. Plot throughput vs threads.

Lab C: RSS shape under a long-running workload

Simulate a service: a main loop that allocates 100,000 16-KiB objects, frees 90% of them, then repeats for 1,000 rounds. Record RSS after every 10 rounds.

Repeat under glibc and jemalloc (and optionally tcmalloc).

You should see jemalloc return RSS to the kernel more aggressively; glibc may hold onto a much higher watermark.

Lab D: Fragmentation stress

Design an allocation pattern that maximizes external fragmentation for glibc (e.g., allocate many 4000-byte objects, free every other one, then try to allocate 8000-byte objects). Measure RSS under glibc vs. jemalloc vs. tcmalloc.

Interpretation

After each lab, write a one-paragraph conclusion naming:

which allocator won on throughput
which allocator won on RSS
any surprising result and its likely cause (arena contention, size-class rounding, release policy)

Evidence Check

This clinic is complete only if you can:

produce a single table comparing all three allocators on a single workload for: peak RSS, RSS-at-exit, throughput, 99th percentile alloc latency
explain, in writing, at least one case where switching from glibc to jemalloc significantly changed RSS on your machine and why
defend a choice of allocator for a hypothetical long-running memory-sensitive service

Retrieval Prompts​

Compare and Distinguish​

Common Mistake Check​

Mini Application​

Lab A: Single-threaded microbenchmark​

Lab B: Multithreaded contention​

Lab C: RSS shape under a long-running workload​

Lab D: Fragmentation stress​

Interpretation​

Evidence Check​

Read This Only If Stuck​