Build Your Own Docker / Container Runtime
"A container is just a process, with some flags."
Containers feel like a major piece of infrastructure. They are. But the underlying primitives — Linux namespaces, cgroups, chroot, capabilities — are accessible from C, Go, or Python in under 200 lines. Building a tiny container runtime is the single best way to demystify Docker.
1. Overview & motivation
A "container" is a process that has been isolated using a few Linux kernel features:
- Namespaces — separate views of the system: PID, mount, UTS (hostname), IPC, net, user.
- cgroups — resource limits: CPU, memory, I/O.
- chroot (or pivot_root) — restricted view of the filesystem.
- capabilities — drop privileges.
- seccomp — restrict system calls.
What you can only learn by building one:
- Why containers are not virtual machines — they share the host kernel, which is both their power and their security limitation.
- Why
clone()with namespace flags is the primitive Docker is built on. - Why overlay filesystems are how Docker images stack layers.
- Why container security is a constant battle (escape paths exist; cgroup misconfigurations matter).
2. Where this fits in the degree
- Phase: Systems
- Semester: 5 (OS and Networking)
- Modules deepened: Module 1 (processes) —
clone()isfork()with knobs. Module 3 (concurrency — namespaces are concurrency on global resources). Module 4 (file systems — overlay FS, chroot).
Cross-phase relevance:
- Direct background for cloud/DevOps work in Sem 9 (Kubernetes manages containers).
- Builds on the Shell tutorial (containers wrap a
fork/exec).
3. Prerequisites
- Complete the Shell tutorial first — you need to be comfortable with
fork/exec/wait. - Linux. The tutorial is Linux-only. Container primitives are Linux kernel features.
- Root access (or capabilities). Most operations require it.
- C or Go. Most of the BYO-X catalog uses one of these.
4. Theory & research
Required reading
- Liz Rice, "Containers from Scratch" (youtube.com/watch?v=8fi7uSYlOdc) — 30-minute live-coded tutorial. ⭐ start here.
- Julien Friedman, "A workshop on Linux containers" — github.com/Fewbytes/rubber-docker. Python workshop, six exercises building toward a runtime.
- Lizzie Dixon, "Linux containers in 500 lines of code" — blog.lizzie.io/linux-containers-in-500-loc.html. C. The single most thorough tutorial. ⭐ recommended primary.
Strongly recommended
- Michael Kerrisk, The Linux Programming Interface — Chapter 28 (creating processes via
clone()), Section 28.2.1 (Linux-specific clone() flags). - Linux man pages —
man 7 namespaces,man 7 capabilities,man 7 cgroups. - OCI Runtime Specification — github.com/opencontainers/runtime-spec. The standard interface that Docker, podman, containerd all implement.
For depth
- runc source code — github.com/opencontainers/runc. The actual reference OCI runtime. Go.
- Aleksa Sarai's blog — cyphar.com. Definitive writing on container internals.
5. Curated tutorial list (from BYO-X)
- C: Linux containers in 500 lines of code — Lizzie Dixon, blog.lizzie.io ⭐ recommended primary
- Go: Build Your Own Container Using Less than 100 Lines of Go — Liz Rice's GoSF talk source
- Go: Building a container from scratch in Go [video] — Liz Rice, GopherCon 2018 ⭐ best video
- Python: A workshop on Linux containers: Rebuild Docker from Scratch — rubber-docker workshop
- Python: A proof-of-concept imitation of Docker, written in 100% Python — tylertreat/pocker
- Shell: Docker implemented in around 100 lines of bash — p8952/bocker
6. Recommended primary path
Two excellent starting points; pick by language preference:
- Liz Rice's video + repo (containers-from-scratch) — Go, 100 lines. Six commits, each adding one isolation feature. Brilliant pacing.
- Lizzie Dixon's "Linux containers in 500 lines of code" — C, more thorough. Includes overlay filesystems and seccomp.
For this degree: Liz Rice's Go path first (1 weekend), then Dixon's C version if you want depth.
The destination is a runtime that meets a small subset of the OCI Runtime Specification — the actual industry standard.
7. Implementation milestones (following Liz Rice's structure)
Milestone 1: fork + exec (no isolation)
A program that runs another program. This is your shell, basically.
func main() {
switch os.Args[1] {
case "run": run()
case "child": child()
}
}
func run() {
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
must(cmd.Run())
}
func child() {
cmd := exec.Command(os.Args[2], os.Args[3:]...)
cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
must(cmd.Run())
}
The run → re-exec self with child argument is the standard Go pattern, since Go can't clone() directly.
Evidence: mycontainer run /bin/bash opens a shell. No isolation yet.
Milestone 2: UTS namespace (hostname isolation)
Set CLONE_NEWUTS so the container has its own hostname.
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS,
}
In the child: syscall.Sethostname([]byte("container")).
Evidence: mycontainer run /bin/bash; hostname shows container, but the host's hostname is unchanged.
Milestone 3: PID namespace
Add CLONE_NEWPID. Inside the container, the first process is PID 1.
After this, ps inside the container still shows the host's processes because /proc is shared. Fix that in the next milestone.
Evidence: echo $$ inside the container shows 1 (or close).
Milestone 4: Mount namespace + chroot
Add CLONE_NEWNS. Mount a private root filesystem (an extracted tarball, e.g., alpine-minirootfs.tar.gz from alpinelinux.org). chroot (or better, pivot_root) into it. Mount /proc so ps works.
must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
must(os.Chdir("/"))
must(syscall.Unmount("/oldrootfs", syscall.MNT_DETACH))
must(syscall.Mount("proc", "/proc", "proc", 0, ""))
Evidence: Inside the container, ls / shows Alpine's filesystem, not your host's. ps shows only container processes.
Milestone 5: cgroups (memory limit)
Create a cgroup, write the limit, add the child PID to it.
mkdir /sys/fs/cgroup/memory/mycontainer
echo 100M > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
echo $PID > /sys/fs/cgroup/memory/mycontainer/tasks
In Go (cgroup v1):
must(os.MkdirAll("/sys/fs/cgroup/memory/mycontainer", 0755))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes", []byte("100M"), 0644))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/tasks", []byte(strconv.Itoa(os.Getpid())), 0644))
For modern systems use cgroup v2 (unified hierarchy).
Evidence: Run a program that tries to allocate 200 MB. OOM killer kills it.
Milestone 6: Network namespace
Add CLONE_NEWNET. The container has its own network stack — initially empty. Add a veth pair if you want it to talk to the outside.
ip link add veth0 type veth peer name veth1
ip link set veth1 netns /proc/$PID/ns/net
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec <ns> ip addr add 10.0.0.2/24 dev veth1
ip netns exec <ns> ip link set veth1 up
Evidence: Container has its own loopback only by default. After veth setup, it can ping the host.
Milestone 7 (optional): Overlay filesystem (image layers)
Use overlayfs to stack a writable layer on top of a read-only base layer. This is how Docker images work.
mkdir lower upper work merged
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
Evidence: Multiple containers share the same base image (lower), but each has its own writes (upper).
Milestone 8 (optional): seccomp filter
Restrict system calls. Block dangerous ones (reboot, kexec_load, mount, ...).
Milestone 9 (optional): User namespace
Map root inside the container to an unprivileged user outside. This is what "rootless containers" use.
8. Tests & evidence
| Test | How |
|---|---|
| Hostname isolation | Container can change hostname; host unchanged |
| PID isolation | PID 1 inside; host PIDs invisible |
| Filesystem isolation | ls / shows different content inside vs outside |
| Memory limit | OOM kill triggers on attempted over-allocation |
| Network isolation | Container starts with only loopback; veth gives it access |
| Process lifecycle | Container exit cleans up cgroups and namespaces |
| Interop | A Docker image's rootfs can be used as your container's filesystem |
The strongest single evidence: a side-by-side terminal recording showing the container is isolated from the host.
9. Common pitfalls
- Re-exec required from Go. Go's runtime doesn't support
clone()with namespace flags directly. The re-exec trick (call yourself with a different argument) is the standard workaround. pivot_rootis finicky. Order matters. Readman 2 pivot_rootcarefully./procconfusion. Inside the container,/procmust be a fresh mount, not the host's.- cgroup v1 vs v2. Different APIs and paths. Modern systems (RHEL 9, Ubuntu 22+) default to v2.
- Capabilities. A container running as "root" inside still has capabilities from the host's perspective unless you drop them.
- Mounts leaking back to the host. Use
MS_PRIVATEpropagation to prevent your container's mounts from polluting the host's mount table. - Forgetting to unshare user namespace. Without it, "root inside" really is root, with all its danger.
- PID 1 special behavior. PID 1 has special signal handling (default signal handlers are no-ops). If your init process doesn't handle signals, the container may hang on shutdown.
10. Extensions
- Container image format. Read a Docker image tarball (
docker save) and use it as your container's filesystem. - OCI spec compliance. Implement enough of
runc's interface thatcrunandrunccan run your config. - Networking via CNI. The Container Network Interface plugin model.
- Rootless mode. Drop the need for
sudo. User namespaces are the trick. - Layered storage. Multiple overlay layers; image building.
- Image building. A toy version of
docker buildfrom a Dockerfile.
11. Module integration
| Module | What the container deepens |
|---|---|
| Sem 5 Module 1 — Processes & scheduling | clone() is fork() with flags. Each namespace is a "child" of the global resource. |
| Sem 5 Module 3 — Concurrency | Namespaces are concurrent views of shared kernel state. |
| Sem 5 Module 4 — File systems & I/O | Overlay, bind mounts, chroot. |
| Shell tutorial | Container = shell + isolation flags. |
| Sem 9 (Production phase) — Cloud / Kubernetes | Kubernetes orchestrates these primitives at scale. Knowing the foundation makes K8s much less mysterious. |
12. Portfolio framing
What to publish:
- Source (
main.go,cmd_run.go,cmd_child.go,cgroups.go,network.go). - A README with a side-by-side terminal recording showing isolation.
- A list of "what Docker has that this doesn't": layers, image registry, networking plugins, security policies.
- A list of OCI features you implemented vs skipped.
What to keep private:
- None — this is portfolio-grade. But be honest about security: a toy container runtime is not a security boundary. State this loudly.
Reviewer entry points:
cmd_run.go— the entry point. TheCloneflagsline is the heart of it.cgroups.go— resource limits.- README must include: a video/GIF of isolation demonstration; security caveats; reference to Liz Rice or Dixon as your starting point.
A 200-line container runtime is a striking portfolio piece because everyone uses Docker and almost no one understands it.
Source
This tutorial draws from the BYO-X catalog "Docker" section. Liz Rice's GopherCon talk and Lizzie Dixon's 500-line C version are the canonical primary tutorials.