Skip to main content

Userland Allocators: dlmalloc, ptmalloc, jemalloc, tcmalloc

What This Concept Is

A userland allocator sits between the program and the kernel. It asks the kernel for memory in large chunks (via sbrk, mmap, or brk) and divides those chunks into application-sized allocations (malloc, free, calloc, realloc).

Four widely used allocators:

  • dlmalloc (Doug Lea, early 1990s). The ancestor. Single heap, boundary tags on every chunk, segregated free lists by size class. Simple, sequential, single-threaded origin.
  • ptmalloc (Wolfram Gloger). glibc's default. Extends dlmalloc with multiple per-thread arenas to reduce lock contention, dynamic arena creation, and some mmap thresholds.
  • jemalloc (Jason Evans, FreeBSD and Facebook). Multiple arenas assigned to threads by hashing, strict size classes, small/large/huge pool separation, strong statistics, low fragmentation, good for long-running services.
  • tcmalloc (Google). Per-thread caches at the front, central free lists per size class behind, page heap at the back. Very fast small-allocation path, excellent multithreaded scaling.

All four use the same core idea: group requests by size class, pool per-thread or per-arena, and fall back to a central or global layer when a local pool runs dry.

Why It Matters Here

Allocator choice is a measurable performance decision:

  • Multithreaded servers doing lots of small allocations routinely see 20-50% throughput improvement from switching glibc -> jemalloc or tcmalloc just by linking differently.
  • Services with very spiky memory growth and shrink (short-lived workers inside a long-lived process) often show much lower RSS on jemalloc than on ptmalloc, because jemalloc returns pages to the kernel more aggressively.
  • A latency-sensitive service may be ruined by ptmalloc arena contention under load; moving to jemalloc or tcmalloc removes the contention without any code change.

Understanding allocator internals also explains confusing symptoms: "memory leak" that turns out to be fragmentation in an untrimmed heap; CPU time in malloc under lock contention; long pauses from malloc_trim; MALLOC_ARENA_MAX knobs that change memory use dramatically.

Concrete Example

Size classes in jemalloc. Requests are rounded up to a fixed set of sizes: 8, 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, ... (small), then larger multiples up to "huge" (page-aligned via mmap). Each size class gets its own runs and free lists; within a run, allocation and free are O(1). Internal fragmentation is bounded by the gap between your request and the next size class.

tcmalloc thread cache. Each thread keeps free lists for common size classes, holding typically up to 64 KiB of free objects. Allocations and frees hit this cache first; no lock. When the cache is too full or too empty, the thread refills from or returns to the central heap. Small allocation path is a handful of assembly instructions.

ptmalloc arena model. On first thread creation, a new arena may be created. Each arena has its own lock. A thread hashes to an arena; if locked, it may try another. On modern glibc, MALLOC_ARENA_MAX often defaults to 8 * num_cpus, which is a lot and contributes to RSS bloat in long-lived multithreaded services.

Symptom example. A 16-core Python server under uwsgi sees RSS 24 GiB, working set ~4 GiB, and mysterious 40% time in libc under perf. LD_PRELOAD=/usr/lib/libjemalloc.so.2 ./server drops RSS to ~6 GiB and removes most of the libc time.

Common Confusion / Misconception

"An allocator is just a data-structure library." It is a systems component with its own memory-management policy, threading model, kernel interaction, and observability. Treating it as a black box is how services end up with mysterious memory bloat.

"free returns memory to the OS." Usually not. It returns memory to the allocator's free lists. Whether the allocator returns those pages to the kernel is a separate decision governed by trim/retain policies. jemalloc and tcmalloc are more aggressive about this than ptmalloc.

"All modern allocators are roughly the same." They are not. They differ significantly in fragmentation, scaling, memory-release behavior, and sensitivity to allocation pattern. Changing allocator on a production service often moves RSS by 30%+ without any code change.

How To Use It

When profiling a memory-heavy service:

  1. Measure malloc/free time with perf or the allocator's own tooling (jemalloc has malloc_stats_print, tcmalloc has MallocExtension::GetStats).
  2. Check per-thread contention. Multiple threads in __lll_lock_wait inside libc is an arena-contention signature.
  3. Check RSS vs. working set. If RSS is far above working set and grows monotonically without leaks, suspect fragmentation and allocator policy.
  4. Consider switching. LD_PRELOAD is a cheap way to test jemalloc or tcmalloc without recompiling.

Check Yourself

  1. Why is a single global lock on malloc a scalability killer on modern multi-core hardware?
  2. What is a "size class," and why does it bound internal fragmentation?
  3. What does MALLOC_ARENA_MAX control in glibc, and why might setting it lower reduce RSS for some services?
  4. Why does a per-thread cache in tcmalloc make the fast path nearly lock-free?
  5. Under what workload would ptmalloc and jemalloc perform roughly the same?

Mini Drill or Application

  1. Write a microbenchmark: N threads, each doing malloc(64); free(...) in a loop. Measure throughput under glibc, jemalloc, and tcmalloc via LD_PRELOAD. (You will do this fully in Practice 3.)
  2. A service's RSS grows 50% over a week while working set stays flat. Is this a leak, fragmentation, or arena accumulation? How would you tell?
  3. What does MALLOC_CONF="stats_print:true" do for a jemalloc-linked process?
  4. Explain why a pure segregated free-list allocator (without size classes) can suffer from fragmentation that a size-class allocator does not.
  5. Describe an allocation pattern that defeats tcmalloc's thread-cache optimization (hint: cross-thread free).

Read This Only If Stuck