Skip to main content

Build Your Own Docker / Container Runtime

"A container is just a process, with some flags."

Containers feel like a major piece of infrastructure. They are. But the underlying primitives — Linux namespaces, cgroups, chroot, capabilities — are accessible from C, Go, or Python in under 200 lines. Building a tiny container runtime is the single best way to demystify Docker.


1. Overview & motivation

A "container" is a process that has been isolated using a few Linux kernel features:

  • Namespaces — separate views of the system: PID, mount, UTS (hostname), IPC, net, user.
  • cgroups — resource limits: CPU, memory, I/O.
  • chroot (or pivot_root) — restricted view of the filesystem.
  • capabilities — drop privileges.
  • seccomp — restrict system calls.

What you can only learn by building one:

  • Why containers are not virtual machines — they share the host kernel, which is both their power and their security limitation.
  • Why clone() with namespace flags is the primitive Docker is built on.
  • Why overlay filesystems are how Docker images stack layers.
  • Why container security is a constant battle (escape paths exist; cgroup misconfigurations matter).

2. Where this fits in the degree

  • Phase: Systems
  • Semester: 5 (OS and Networking)
  • Modules deepened: Module 1 (processes) — clone() is fork() with knobs. Module 3 (concurrency — namespaces are concurrency on global resources). Module 4 (file systems — overlay FS, chroot).

Cross-phase relevance:

  • Direct background for cloud/DevOps work in Sem 9 (Kubernetes manages containers).
  • Builds on the Shell tutorial (containers wrap a fork/exec).

3. Prerequisites

  • Complete the Shell tutorial first — you need to be comfortable with fork/exec/wait.
  • Linux. The tutorial is Linux-only. Container primitives are Linux kernel features.
  • Root access (or capabilities). Most operations require it.
  • C or Go. Most of the BYO-X catalog uses one of these.

4. Theory & research

Required reading

  • Michael Kerrisk, The Linux Programming Interface — Chapter 28 (creating processes via clone()), Section 28.2.1 (Linux-specific clone() flags).
  • Linux man pagesman 7 namespaces, man 7 capabilities, man 7 cgroups.
  • OCI Runtime Specificationgithub.com/opencontainers/runtime-spec. The standard interface that Docker, podman, containerd all implement.

For depth


5. Curated tutorial list (from BYO-X)


Two excellent starting points; pick by language preference:

  • Liz Rice's video + repo (containers-from-scratch) — Go, 100 lines. Six commits, each adding one isolation feature. Brilliant pacing.
  • Lizzie Dixon's "Linux containers in 500 lines of code" — C, more thorough. Includes overlay filesystems and seccomp.

For this degree: Liz Rice's Go path first (1 weekend), then Dixon's C version if you want depth.

The destination is a runtime that meets a small subset of the OCI Runtime Specification — the actual industry standard.


7. Implementation milestones (following Liz Rice's structure)

Milestone 1: fork + exec (no isolation)

A program that runs another program. This is your shell, basically.

func main() {
switch os.Args[1] {
case "run": run()
case "child": child()
}
}

func run() {
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
must(cmd.Run())
}

func child() {
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
must(cmd.Run())
}

The run → re-exec self with child argument is the standard Go pattern, since Go can't clone() directly.

Evidence: mycontainer run /bin/bash opens a shell. No isolation yet.

Milestone 2: UTS namespace (hostname isolation)

Set CLONE_NEWUTS so the container has its own hostname.

cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS,
}

In the child: syscall.Sethostname([]byte("container")).

Evidence: mycontainer run /bin/bash; hostname shows container, but the host's hostname is unchanged.

Milestone 3: PID namespace

Add CLONE_NEWPID. Inside the container, the first process is PID 1.

After this, ps inside the container still shows the host's processes because /proc is shared. Fix that in the next milestone.

Evidence: echo $$ inside the container shows 1 (or close).

Milestone 4: Mount namespace + chroot

Add CLONE_NEWNS. Mount a private root filesystem (an extracted tarball, e.g., alpine-minirootfs.tar.gz from alpinelinux.org). chroot (or better, pivot_root) into it. Mount /proc so ps works.

must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
must(os.Chdir("/"))
must(syscall.Unmount("/oldrootfs", syscall.MNT_DETACH))
must(syscall.Mount("proc", "/proc", "proc", 0, ""))

Evidence: Inside the container, ls / shows Alpine's filesystem, not your host's. ps shows only container processes.

Milestone 5: cgroups (memory limit)

Create a cgroup, write the limit, add the child PID to it.

mkdir /sys/fs/cgroup/memory/mycontainer
echo 100M > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
echo $PID > /sys/fs/cgroup/memory/mycontainer/tasks

In Go (cgroup v1):

must(os.MkdirAll("/sys/fs/cgroup/memory/mycontainer", 0755))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes", []byte("100M"), 0644))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/tasks", []byte(strconv.Itoa(os.Getpid())), 0644))

For modern systems use cgroup v2 (unified hierarchy).

Evidence: Run a program that tries to allocate 200 MB. OOM killer kills it.

Milestone 6: Network namespace

Add CLONE_NEWNET. The container has its own network stack — initially empty. Add a veth pair if you want it to talk to the outside.

ip link add veth0 type veth peer name veth1
ip link set veth1 netns /proc/$PID/ns/net
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec <ns> ip addr add 10.0.0.2/24 dev veth1
ip netns exec <ns> ip link set veth1 up

Evidence: Container has its own loopback only by default. After veth setup, it can ping the host.

Milestone 7 (optional): Overlay filesystem (image layers)

Use overlayfs to stack a writable layer on top of a read-only base layer. This is how Docker images work.

mkdir lower upper work merged
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged

Evidence: Multiple containers share the same base image (lower), but each has its own writes (upper).

Milestone 8 (optional): seccomp filter

Restrict system calls. Block dangerous ones (reboot, kexec_load, mount, ...).

Milestone 9 (optional): User namespace

Map root inside the container to an unprivileged user outside. This is what "rootless containers" use.


8. Tests & evidence

TestHow
Hostname isolationContainer can change hostname; host unchanged
PID isolationPID 1 inside; host PIDs invisible
Filesystem isolationls / shows different content inside vs outside
Memory limitOOM kill triggers on attempted over-allocation
Network isolationContainer starts with only loopback; veth gives it access
Process lifecycleContainer exit cleans up cgroups and namespaces
InteropA Docker image's rootfs can be used as your container's filesystem

The strongest single evidence: a side-by-side terminal recording showing the container is isolated from the host.


9. Common pitfalls

  • Re-exec required from Go. Go's runtime doesn't support clone() with namespace flags directly. The re-exec trick (call yourself with a different argument) is the standard workaround.
  • pivot_root is finicky. Order matters. Read man 2 pivot_root carefully.
  • /proc confusion. Inside the container, /proc must be a fresh mount, not the host's.
  • cgroup v1 vs v2. Different APIs and paths. Modern systems (RHEL 9, Ubuntu 22+) default to v2.
  • Capabilities. A container running as "root" inside still has capabilities from the host's perspective unless you drop them.
  • Mounts leaking back to the host. Use MS_PRIVATE propagation to prevent your container's mounts from polluting the host's mount table.
  • Forgetting to unshare user namespace. Without it, "root inside" really is root, with all its danger.
  • PID 1 special behavior. PID 1 has special signal handling (default signal handlers are no-ops). If your init process doesn't handle signals, the container may hang on shutdown.

10. Extensions

  • Container image format. Read a Docker image tarball (docker save) and use it as your container's filesystem.
  • OCI spec compliance. Implement enough of runc's interface that crun and runc can run your config.
  • Networking via CNI. The Container Network Interface plugin model.
  • Rootless mode. Drop the need for sudo. User namespaces are the trick.
  • Layered storage. Multiple overlay layers; image building.
  • Image building. A toy version of docker build from a Dockerfile.

11. Module integration

ModuleWhat the container deepens
Sem 5 Module 1 — Processes & schedulingclone() is fork() with flags. Each namespace is a "child" of the global resource.
Sem 5 Module 3 — ConcurrencyNamespaces are concurrent views of shared kernel state.
Sem 5 Module 4 — File systems & I/OOverlay, bind mounts, chroot.
Shell tutorialContainer = shell + isolation flags.
Sem 9 (Production phase) — Cloud / KubernetesKubernetes orchestrates these primitives at scale. Knowing the foundation makes K8s much less mysterious.

12. Portfolio framing

What to publish:

  • Source (main.go, cmd_run.go, cmd_child.go, cgroups.go, network.go).
  • A README with a side-by-side terminal recording showing isolation.
  • A list of "what Docker has that this doesn't": layers, image registry, networking plugins, security policies.
  • A list of OCI features you implemented vs skipped.

What to keep private:

  • None — this is portfolio-grade. But be honest about security: a toy container runtime is not a security boundary. State this loudly.

Reviewer entry points:

  • cmd_run.go — the entry point. The Cloneflags line is the heart of it.
  • cgroups.go — resource limits.
  • README must include: a video/GIF of isolation demonstration; security caveats; reference to Liz Rice or Dixon as your starting point.

A 200-line container runtime is a striking portfolio piece because everyone uses Docker and almost no one understands it.


Source

This tutorial draws from the BYO-X catalog "Docker" section. Liz Rice's GopherCon talk and Lizzie Dixon's 500-line C version are the canonical primary tutorials.