Skip to main content

Namespaces and Cgroups: The Two Kernel Features Behind Containers

What This Concept Is

A container is not a small virtual machine. It is an ordinary Linux process that has been given a partial, per-process view of the kernel's global resources.

Two kernel features do nearly all the work:

  • Namespaces give a process its own isolated view of a global resource (processes, filesystems, networks, etc.). The kernel keeps a separate table per namespace and decides which one to consult based on the calling process.
  • Cgroups (control groups, v2 on modern systems) bound what that process can consume (CPU shares, memory caps, I/O bandwidth, pid counts) and account for it.

There are seven commonly used namespace types, each isolating one kind of global state:

NamespaceIsolates
mntmount points (filesystem layout)
pidprocess IDs (the container sees PID 1)
netnetwork devices, routing tables, sockets, /proc/net
utshostname and domainname
ipcSystem V IPC and POSIX message queues
userUID/GID mappings and capabilities
cgroupthe view of the process's own cgroup tree

(time namespaces also exist for virtualizing CLOCK_MONOTONIC and CLOCK_BOOTTIME, but are rarely used by container runtimes.)

Why It Matters Here

Every higher-level idea in this module -- pods, images, the runtime, Kubernetes' securityContext -- is a wrapper over these two features. If you cannot explain what namespaces and cgroups do, you cannot explain what Kubernetes actually enforces.

Common downstream questions you cannot answer without this:

  • "Why can a container see the host's time but not its hostname?"
  • "Why does ps -ef inside a container show only a few processes?"
  • "Why does memory.limit_in_bytes matter, and why does exceeding it produce an OOMKill?"
  • "Why does setting runAsUser: 1000 sometimes still not prevent access to a host file?"

Concrete Example

You can build a minimal container by hand with unshare:

sudo unshare --fork --pid --mount-proc --uts --ipc --net --mount \
--user --map-root-user /bin/bash

Inside the resulting shell:

  • ps -ef shows a handful of processes, because pid namespace was unshared and /proc was remounted.
  • hostname new-host changes only this shell's hostname (uts namespace).
  • ip link shows only lo and whatever you moved in (net namespace).
  • id shows uid=0(root), but outside, the process is still the real unprivileged user (user namespace).

Cgroups v2 are visible at /sys/fs/cgroup. To cap memory on the container:

echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs
echo 100M > /sys/fs/cgroup/mygroup/memory.max

Now this process is both isolated (namespaces) and bounded (cgroups). That pair is a container.

Common Confusion / Misconception

"A container is a lightweight VM."

It is not. A VM has its own kernel; a container shares the host kernel and only gets isolated views of the kernel's tables. A kernel panic inside a container is a host panic. A kernel CVE that bypasses namespace checks is a container escape. This is why user namespaces and seccomp profiles matter: they narrow the attack surface between the container and the shared kernel.

A second confusion: "cgroups are for security." They are not a security boundary. They are a resource accounting and limiting mechanism. Isolation is namespaces plus capabilities plus seccomp; cgroups prevent a noisy neighbor, not a malicious one.

How To Use It

When reading a Kubernetes security or isolation question, translate into the primitives:

  1. What is being isolated? Map the requirement to a namespace type.
  2. What is being bounded? Map the requirement to a cgroup controller (cpu, memory, pids, io).
  3. Who can enter the namespace? Map to capabilities (CAP_SYS_ADMIN), userns mapping, and hostPID/hostNetwork/hostIPC flags in the Pod spec.

How This Maps to Pod Spec Fields

Several Pod spec fields exist precisely to control which namespace a pod does or does not unshare. Reading them without this mental model is guesswork:

Pod spec fieldWhat it does in namespace terms
hostPID: trueDo not unshare the pid namespace -- see the host's process tree
hostNetwork: trueDo not unshare net -- container shares the host's network devices and routes
hostIPC: trueDo not unshare ipc -- share System V / POSIX IPC with host
spec.hostnameSet uts namespace's hostname for the pod
spec.securityContext.runAsUserProvide a UID inside the user namespace mapping
resources.limits.cpu / memoryProgram the cpu and memory cgroup controllers
spec.securityContext.fsGroupGID applied to mounted volumes in the mnt namespace view

When a security reviewer asks "what is this pod allowed to see?", you can answer it by reading these fields alone.

Check Yourself

  1. Name all seven common namespace types and what each isolates.
  2. Why are cgroups not a security feature?
  3. If a pod sets hostNetwork: true, which namespace is it not unsharing, and what are the consequences?
  4. Map each of hostPID, hostNetwork, hostIPC to the namespace it suppresses, and give one reason you might legitimately want each.
  5. Why does memory.max being exceeded produce an OOMKill rather than swap or graceful degradation?

Mini Drill or Application

Use unshare to build a container by hand. Enter it, run ps -ef, hostname, ip a, and id. Then exit and run the same commands on the host. Write one paragraph explaining which namespace made each output different, and what you would have to change to punch a hole through one of them.

Then list namespaces via lsns on the host and enter one by PID:

lsns
sudo nsenter --target <pid> --pid --mount --net --uts --ipc /bin/bash

Explain what nsenter does that is the reverse of unshare, and why an attacker who obtains CAP_SYS_ADMIN on the host can nsenter into any container. This is why pods running privileged: true are effectively on the host security boundary.

Cgroups v2: One Hierarchy, Many Controllers

Cgroups v2 replaced the v1 "one hierarchy per controller" design with a single unified hierarchy. Each directory in /sys/fs/cgroup/ is a cgroup; each cgroup enables zero or more controllers (cpu, memory, io, pids) via its parent's cgroup.subtree_control.

Key v2 files a runtime touches per container:

  • cpu.max -- "<quota> <period>" in microseconds; the CFS ceiling
  • cpu.weight -- relative share (1-10000) when nodes are contested
  • memory.max -- hard ceiling; exceeding triggers OOMKiller inside the cgroup
  • memory.high -- soft throttling threshold (reclaim before OOM)
  • pids.max -- cap on process count (prevents fork bombs)
  • io.max -- per-device bandwidth and IOPS limits

Kubernetes resource limits land here; requests affect cpu.weight and scheduling but are not a kernel ceiling.

Check-Before-You-Debug Mapping

When a symptom looks confusing, map it to a primitive before you open logs:

SymptomPrimitive to check
Pod sees host processes it should nothostPID: true was set -- pid namespace not unshared
Pod listens on a host port "by magic"hostNetwork: true -- net namespace not unshared
Memory limit exceeded, container disappearsmemory.max reached; OOMKiller targeted cgroup
CPU tanks under load despite requestscpu.max (throttling) -- note requests is cpu.weight only
File owned by UID 1000 inside is "root" on hostuser namespace mapping (uid_map, gid_map)
fork: Resource temporarily unavailablepids.max cgroup cap hit (fork bomb protection)

Reading Pod specs through this lens turns "magic" into routine.

Read This Only If Stuck