Namespaces and Cgroups: The Two Kernel Features Behind Containers

What This Concept Is

A container is not a small virtual machine. It is an ordinary Linux process that has been given a partial, per-process view of the kernel's global resources.

Two kernel features do nearly all the work:

Namespaces give a process its own isolated view of a global resource (processes, filesystems, networks, etc.). The kernel keeps a separate table per namespace and decides which one to consult based on the calling process.
Cgroups (control groups, v2 on modern systems) bound what that process can consume (CPU shares, memory caps, I/O bandwidth, pid counts) and account for it.

There are seven commonly used namespace types, each isolating one kind of global state:

Namespace	Isolates
`mnt`	mount points (filesystem layout)
`pid`	process IDs (the container sees `PID 1`)
`net`	network devices, routing tables, sockets, `/proc/net`
`uts`	`hostname` and `domainname`
`ipc`	System V IPC and POSIX message queues
`user`	UID/GID mappings and capabilities
`cgroup`	the view of the process's own cgroup tree

(time namespaces also exist for virtualizing CLOCK_MONOTONIC and CLOCK_BOOTTIME, but are rarely used by container runtimes.)

Why It Matters Here

Every higher-level idea in this module -- pods, images, the runtime, Kubernetes' securityContext -- is a wrapper over these two features. If you cannot explain what namespaces and cgroups do, you cannot explain what Kubernetes actually enforces.

Common downstream questions you cannot answer without this:

"Why can a container see the host's time but not its hostname?"
"Why does ps -ef inside a container show only a few processes?"
"Why does memory.limit_in_bytes matter, and why does exceeding it produce an OOMKill?"
"Why does setting runAsUser: 1000 sometimes still not prevent access to a host file?"

Concrete Example

You can build a minimal container by hand with unshare:

sudo unshare --fork --pid --mount-proc --uts --ipc --net --mount \
  --user --map-root-user /bin/bash

Inside the resulting shell:

ps -ef shows a handful of processes, because pid namespace was unshared and /proc was remounted.
hostname new-host changes only this shell's hostname (uts namespace).
ip link shows only lo and whatever you moved in (net namespace).
id shows uid=0(root), but outside, the process is still the real unprivileged user (user namespace).

Cgroups v2 are visible at /sys/fs/cgroup. To cap memory on the container:

echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs
echo 100M > /sys/fs/cgroup/mygroup/memory.max

Now this process is both isolated (namespaces) and bounded (cgroups). That pair is a container.

Common Confusion / Misconception

"A container is a lightweight VM."

It is not. A VM has its own kernel; a container shares the host kernel and only gets isolated views of the kernel's tables. A kernel panic inside a container is a host panic. A kernel CVE that bypasses namespace checks is a container escape. This is why user namespaces and seccomp profiles matter: they narrow the attack surface between the container and the shared kernel.

A second confusion: "cgroups are for security." They are not a security boundary. They are a resource accounting and limiting mechanism. Isolation is namespaces plus capabilities plus seccomp; cgroups prevent a noisy neighbor, not a malicious one.

How To Use It

When reading a Kubernetes security or isolation question, translate into the primitives:

What is being isolated? Map the requirement to a namespace type.
What is being bounded? Map the requirement to a cgroup controller (cpu, memory, pids, io).
Who can enter the namespace? Map to capabilities (CAP_SYS_ADMIN), userns mapping, and hostPID/hostNetwork/hostIPC flags in the Pod spec.

How This Maps to Pod Spec Fields

Several Pod spec fields exist precisely to control which namespace a pod does or does not unshare. Reading them without this mental model is guesswork:

Pod spec field	What it does in namespace terms
`hostPID: true`	Do not unshare the `pid` namespace -- see the host's process tree
`hostNetwork: true`	Do not unshare `net` -- container shares the host's network devices and routes
`hostIPC: true`	Do not unshare `ipc` -- share System V / POSIX IPC with host
`spec.hostname`	Set `uts` namespace's hostname for the pod
`spec.securityContext.runAsUser`	Provide a UID inside the `user` namespace mapping
`resources.limits.cpu / memory`	Program the `cpu` and `memory` cgroup controllers
`spec.securityContext.fsGroup`	GID applied to mounted volumes in the `mnt` namespace view

When a security reviewer asks "what is this pod allowed to see?", you can answer it by reading these fields alone.

Check Yourself

Name all seven common namespace types and what each isolates.
Why are cgroups not a security feature?
If a pod sets hostNetwork: true, which namespace is it not unsharing, and what are the consequences?
Map each of hostPID, hostNetwork, hostIPC to the namespace it suppresses, and give one reason you might legitimately want each.
Why does memory.max being exceeded produce an OOMKill rather than swap or graceful degradation?

Mini Drill or Application

Use unshare to build a container by hand. Enter it, run ps -ef, hostname, ip a, and id. Then exit and run the same commands on the host. Write one paragraph explaining which namespace made each output different, and what you would have to change to punch a hole through one of them.

Then list namespaces via lsns on the host and enter one by PID:

lsns
sudo nsenter --target <pid> --pid --mount --net --uts --ipc /bin/bash

Explain what nsenter does that is the reverse of unshare, and why an attacker who obtains CAP_SYS_ADMIN on the host can nsenter into any container. This is why pods running privileged: true are effectively on the host security boundary.

Cgroups v2: One Hierarchy, Many Controllers

Cgroups v2 replaced the v1 "one hierarchy per controller" design with a single unified hierarchy. Each directory in /sys/fs/cgroup/ is a cgroup; each cgroup enables zero or more controllers (cpu, memory, io, pids) via its parent's cgroup.subtree_control.

Key v2 files a runtime touches per container:

cpu.max -- "<quota> <period>" in microseconds; the CFS ceiling
cpu.weight -- relative share (1-10000) when nodes are contested
memory.max -- hard ceiling; exceeding triggers OOMKiller inside the cgroup
memory.high -- soft throttling threshold (reclaim before OOM)
pids.max -- cap on process count (prevents fork bombs)
io.max -- per-device bandwidth and IOPS limits

Kubernetes resource limits land here; requests affect cpu.weight and scheduling but are not a kernel ceiling.

Check-Before-You-Debug Mapping

When a symptom looks confusing, map it to a primitive before you open logs:

Symptom	Primitive to check
Pod sees host processes it should not	`hostPID: true` was set -- `pid` namespace not unshared
Pod listens on a host port "by magic"	`hostNetwork: true` -- `net` namespace not unshared
Memory limit exceeded, container disappears	`memory.max` reached; OOMKiller targeted cgroup
CPU tanks under load despite `requests`	`cpu.max` (throttling) -- note `requests` is `cpu.weight` only
File owned by UID 1000 inside is "root" on host	`user` namespace mapping (`uid_map`, `gid_map`)
`fork: Resource temporarily unavailable`	`pids.max` cgroup cap hit (fork bomb protection)

Reading Pod specs through this lens turns "magic" into routine.

Read This Only If Stuck

Linux Command Line: How a process works and viewing processes -- foundation for the PID namespace view.
Linux Command Line: Viewing processes dynamically with top -- tools whose output a container reshapes.
Linux Command Line: Owners, group members, and everybody else -- UID/GID model user namespaces remap.
Linux man page: namespaces(7) -- authoritative enumeration of the seven namespace types.
Linux man page: cgroups(7) -- the v2 unified hierarchy and controller interface files.
Linux man page: user_namespaces(7) -- UID/GID mapping and capability scoping inside a user namespace.
Kubernetes: Security Contexts (SecurityContext) -- how Pod fields surface namespace/cgroup knobs.
Kubernetes: Node cgroup drivers -- why cgroups v2 matters for the kubelet and which driver to pick.
CNCF: runc -- OCI reference runtime -- the actual code that calls clone() with the namespace flags.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

How This Maps to Pod Spec Fields​

Check Yourself​

Mini Drill or Application​

Cgroups v2: One Hierarchy, Many Controllers​

Check-Before-You-Debug Mapping​

Read This Only If Stuck​