Namespaces and Cgroups: The Two Kernel Features Behind Containers
What This Concept Is
A container is not a small virtual machine. It is an ordinary Linux process that has been given a partial, per-process view of the kernel's global resources.
Two kernel features do nearly all the work:
- Namespaces give a process its own isolated view of a global resource (processes, filesystems, networks, etc.). The kernel keeps a separate table per namespace and decides which one to consult based on the calling process.
- Cgroups (control groups, v2 on modern systems) bound what that process can consume (CPU shares, memory caps, I/O bandwidth, pid counts) and account for it.
There are seven commonly used namespace types, each isolating one kind of global state:
| Namespace | Isolates |
|---|---|
mnt | mount points (filesystem layout) |
pid | process IDs (the container sees PID 1) |
net | network devices, routing tables, sockets, /proc/net |
uts | hostname and domainname |
ipc | System V IPC and POSIX message queues |
user | UID/GID mappings and capabilities |
cgroup | the view of the process's own cgroup tree |
(time namespaces also exist for virtualizing CLOCK_MONOTONIC and CLOCK_BOOTTIME, but are rarely used by container runtimes.)
Why It Matters Here
Every higher-level idea in this module -- pods, images, the runtime, Kubernetes' securityContext -- is a wrapper over these two features. If you cannot explain what namespaces and cgroups do, you cannot explain what Kubernetes actually enforces.
Common downstream questions you cannot answer without this:
- "Why can a container see the host's time but not its hostname?"
- "Why does
ps -efinside a container show only a few processes?" - "Why does
memory.limit_in_bytesmatter, and why does exceeding it produce an OOMKill?" - "Why does setting
runAsUser: 1000sometimes still not prevent access to a host file?"
Concrete Example
You can build a minimal container by hand with unshare:
sudo unshare --fork --pid --mount-proc --uts --ipc --net --mount \
--user --map-root-user /bin/bash
Inside the resulting shell:
ps -efshows a handful of processes, becausepidnamespace was unshared and/procwas remounted.hostname new-hostchanges only this shell's hostname (utsnamespace).ip linkshows onlyloand whatever you moved in (netnamespace).idshowsuid=0(root), but outside, the process is still the real unprivileged user (usernamespace).
Cgroups v2 are visible at /sys/fs/cgroup. To cap memory on the container:
echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs
echo 100M > /sys/fs/cgroup/mygroup/memory.max
Now this process is both isolated (namespaces) and bounded (cgroups). That pair is a container.
Common Confusion / Misconception
"A container is a lightweight VM."
It is not. A VM has its own kernel; a container shares the host kernel and only gets isolated views of the kernel's tables. A kernel panic inside a container is a host panic. A kernel CVE that bypasses namespace checks is a container escape. This is why user namespaces and seccomp profiles matter: they narrow the attack surface between the container and the shared kernel.
A second confusion: "cgroups are for security." They are not a security boundary. They are a resource accounting and limiting mechanism. Isolation is namespaces plus capabilities plus seccomp; cgroups prevent a noisy neighbor, not a malicious one.
How To Use It
When reading a Kubernetes security or isolation question, translate into the primitives:
- What is being isolated? Map the requirement to a namespace type.
- What is being bounded? Map the requirement to a cgroup controller (cpu, memory, pids, io).
- Who can enter the namespace? Map to capabilities (
CAP_SYS_ADMIN),usernsmapping, andhostPID/hostNetwork/hostIPCflags in the Pod spec.
How This Maps to Pod Spec Fields
Several Pod spec fields exist precisely to control which namespace a pod does or does not unshare. Reading them without this mental model is guesswork:
| Pod spec field | What it does in namespace terms |
|---|---|
hostPID: true | Do not unshare the pid namespace -- see the host's process tree |
hostNetwork: true | Do not unshare net -- container shares the host's network devices and routes |
hostIPC: true | Do not unshare ipc -- share System V / POSIX IPC with host |
spec.hostname | Set uts namespace's hostname for the pod |
spec.securityContext.runAsUser | Provide a UID inside the user namespace mapping |
resources.limits.cpu / memory | Program the cpu and memory cgroup controllers |
spec.securityContext.fsGroup | GID applied to mounted volumes in the mnt namespace view |
When a security reviewer asks "what is this pod allowed to see?", you can answer it by reading these fields alone.
Check Yourself
- Name all seven common namespace types and what each isolates.
- Why are cgroups not a security feature?
- If a pod sets
hostNetwork: true, which namespace is it not unsharing, and what are the consequences? - Map each of
hostPID,hostNetwork,hostIPCto the namespace it suppresses, and give one reason you might legitimately want each. - Why does
memory.maxbeing exceeded produce an OOMKill rather than swap or graceful degradation?
Mini Drill or Application
Use unshare to build a container by hand. Enter it, run ps -ef, hostname, ip a, and id. Then exit and run the same commands on the host. Write one paragraph explaining which namespace made each output different, and what you would have to change to punch a hole through one of them.
Then list namespaces via lsns on the host and enter one by PID:
lsns
sudo nsenter --target <pid> --pid --mount --net --uts --ipc /bin/bash
Explain what nsenter does that is the reverse of unshare, and why an attacker who obtains CAP_SYS_ADMIN on the host can nsenter into any container. This is why pods running privileged: true are effectively on the host security boundary.
Cgroups v2: One Hierarchy, Many Controllers
Cgroups v2 replaced the v1 "one hierarchy per controller" design with a single unified hierarchy. Each directory in /sys/fs/cgroup/ is a cgroup; each cgroup enables zero or more controllers (cpu, memory, io, pids) via its parent's cgroup.subtree_control.
Key v2 files a runtime touches per container:
cpu.max-- "<quota> <period>" in microseconds; the CFS ceilingcpu.weight-- relative share (1-10000) when nodes are contestedmemory.max-- hard ceiling; exceeding triggers OOMKiller inside the cgroupmemory.high-- soft throttling threshold (reclaim before OOM)pids.max-- cap on process count (prevents fork bombs)io.max-- per-device bandwidth and IOPS limits
Kubernetes resource limits land here; requests affect cpu.weight and scheduling but are not a kernel ceiling.
Check-Before-You-Debug Mapping
When a symptom looks confusing, map it to a primitive before you open logs:
| Symptom | Primitive to check |
|---|---|
| Pod sees host processes it should not | hostPID: true was set -- pid namespace not unshared |
| Pod listens on a host port "by magic" | hostNetwork: true -- net namespace not unshared |
| Memory limit exceeded, container disappears | memory.max reached; OOMKiller targeted cgroup |
CPU tanks under load despite requests | cpu.max (throttling) -- note requests is cpu.weight only |
| File owned by UID 1000 inside is "root" on host | user namespace mapping (uid_map, gid_map) |
fork: Resource temporarily unavailable | pids.max cgroup cap hit (fork bomb protection) |
Reading Pod specs through this lens turns "magic" into routine.
Read This Only If Stuck
- Linux Command Line: How a process works and viewing processes -- foundation for the PID namespace view.
- Linux Command Line: Viewing processes dynamically with top -- tools whose output a container reshapes.
- Linux Command Line: Owners, group members, and everybody else -- UID/GID model user namespaces remap.
- Linux man page:
namespaces(7)-- authoritative enumeration of the seven namespace types. - Linux man page:
cgroups(7)-- the v2 unified hierarchy and controller interface files. - Linux man page:
user_namespaces(7)-- UID/GID mapping and capability scoping inside a user namespace. - Kubernetes: Security Contexts (
SecurityContext) -- how Pod fields surface namespace/cgroup knobs. - Kubernetes: Node
cgroupdrivers -- why cgroups v2 matters for the kubelet and which driver to pick. - CNCF: runc -- OCI reference runtime -- the actual code that calls
clone()with the namespace flags.