Build Your Own Docker / Container Runtime

"A container is just a process, with some flags."

Containers feel like a major piece of infrastructure. They are. But the underlying primitives — Linux namespaces, cgroups, chroot, capabilities — are accessible from C, Go, or Python in under 200 lines. Building a tiny container runtime is the single best way to demystify Docker.

1. Overview & motivation

A "container" is a process that has been isolated using a few Linux kernel features:

Namespaces — separate views of the system: PID, mount, UTS (hostname), IPC, net, user.
cgroups — resource limits: CPU, memory, I/O.
chroot (or pivot_root) — restricted view of the filesystem.
capabilities — drop privileges.
seccomp — restrict system calls.

What you can only learn by building one:

Why containers are not virtual machines — they share the host kernel, which is both their power and their security limitation.
Why clone() with namespace flags is the primitive Docker is built on.
Why overlay filesystems are how Docker images stack layers.
Why container security is a constant battle (escape paths exist; cgroup misconfigurations matter).

2. Where this fits in the degree

Phase: Systems
Semester: 5 (OS and Networking)
Modules deepened: Module 1 (processes) — clone() is fork() with knobs. Module 3 (concurrency — namespaces are concurrency on global resources). Module 4 (file systems — overlay FS, chroot).

Cross-phase relevance:

Direct background for cloud/DevOps work in Sem 9 (Kubernetes manages containers).
Builds on the Shell tutorial (containers wrap a fork/exec).

3. Prerequisites

Complete the Shell tutorial first — you need to be comfortable with fork/exec/wait.
Linux. The tutorial is Linux-only. Container primitives are Linux kernel features.
Root access (or capabilities). Most operations require it.
C or Go. Most of the BYO-X catalog uses one of these.

4. Theory & research

Required reading

Liz Rice, "Containers from Scratch" (youtube.com/watch?v=8fi7uSYlOdc) — 30-minute live-coded tutorial. ⭐ start here.
Julien Friedman, "A workshop on Linux containers" — github.com/Fewbytes/rubber-docker. Python workshop, six exercises building toward a runtime.
Lizzie Dixon, "Linux containers in 500 lines of code" — blog.lizzie.io/linux-containers-in-500-loc.html. C. The single most thorough tutorial. ⭐ recommended primary.

Strongly recommended

Michael Kerrisk, The Linux Programming Interface — Chapter 28 (creating processes via clone()), Section 28.2.1 (Linux-specific clone() flags).
Linux man pages — man 7 namespaces, man 7 capabilities, man 7 cgroups.
OCI Runtime Specification — github.com/opencontainers/runtime-spec. The standard interface that Docker, podman, containerd all implement.

For depth

runc source code — github.com/opencontainers/runc. The actual reference OCI runtime. Go.
Aleksa Sarai's blog — cyphar.com. Definitive writing on container internals.

5. Curated tutorial list (from BYO-X)

C: Linux containers in 500 lines of code — Lizzie Dixon, blog.lizzie.io ⭐ recommended primary
Go: Build Your Own Container Using Less than 100 Lines of Go — Liz Rice's GoSF talk source
Go: Building a container from scratch in Go [video] — Liz Rice, GopherCon 2018 ⭐ best video
Python: A workshop on Linux containers: Rebuild Docker from Scratch — rubber-docker workshop
Python: A proof-of-concept imitation of Docker, written in 100% Python — tylertreat/pocker
Shell: Docker implemented in around 100 lines of bash — p8952/bocker

6. Recommended primary path

Two excellent starting points; pick by language preference:

Liz Rice's video + repo (containers-from-scratch) — Go, 100 lines. Six commits, each adding one isolation feature. Brilliant pacing.
Lizzie Dixon's "Linux containers in 500 lines of code" — C, more thorough. Includes overlay filesystems and seccomp.

For this degree: Liz Rice's Go path first (1 weekend), then Dixon's C version if you want depth.

The destination is a runtime that meets a small subset of the OCI Runtime Specification — the actual industry standard.

7. Implementation milestones (following Liz Rice's structure)

Milestone 1: `fork + exec` (no isolation)

A program that runs another program. This is your shell, basically.

func main() {
    switch os.Args[1] {
    case "run": run()
    case "child": child()
    }
}

func run() {
    cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)
    cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
    must(cmd.Run())
}

func child() {
    cmd := exec.Command(os.Args[2], os.Args[3:]...)
    cmd.Stdin = os.Stdin; cmd.Stdout = os.Stdout; cmd.Stderr = os.Stderr
    must(cmd.Run())
}

The run → re-exec self with child argument is the standard Go pattern, since Go can't clone() directly.

Evidence: mycontainer run /bin/bash opens a shell. No isolation yet.

Milestone 2: UTS namespace (hostname isolation)

Set CLONE_NEWUTS so the container has its own hostname.

cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS,
}

In the child: syscall.Sethostname([]byte("container")).

Evidence: mycontainer run /bin/bash; hostname shows container, but the host's hostname is unchanged.

Milestone 3: PID namespace

Add CLONE_NEWPID. Inside the container, the first process is PID 1.

After this, ps inside the container still shows the host's processes because /proc is shared. Fix that in the next milestone.

Evidence: echo $$ inside the container shows 1 (or close).

Milestone 4: Mount namespace + chroot

Add CLONE_NEWNS. Mount a private root filesystem (an extracted tarball, e.g., alpine-minirootfs.tar.gz from alpinelinux.org). chroot (or better, pivot_root) into it. Mount /proc so ps works.

must(syscall.Mount("rootfs", "rootfs", "", syscall.MS_BIND, ""))
must(syscall.PivotRoot("rootfs", "rootfs/oldrootfs"))
must(os.Chdir("/"))
must(syscall.Unmount("/oldrootfs", syscall.MNT_DETACH))
must(syscall.Mount("proc", "/proc", "proc", 0, ""))

Evidence: Inside the container, ls / shows Alpine's filesystem, not your host's. ps shows only container processes.

Milestone 5: cgroups (memory limit)

Create a cgroup, write the limit, add the child PID to it.

mkdir /sys/fs/cgroup/memory/mycontainer
echo 100M > /sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes
echo $PID > /sys/fs/cgroup/memory/mycontainer/tasks

In Go (cgroup v1):

must(os.MkdirAll("/sys/fs/cgroup/memory/mycontainer", 0755))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/memory.limit_in_bytes", []byte("100M"), 0644))
must(os.WriteFile("/sys/fs/cgroup/memory/mycontainer/tasks", []byte(strconv.Itoa(os.Getpid())), 0644))

For modern systems use cgroup v2 (unified hierarchy).

Evidence: Run a program that tries to allocate 200 MB. OOM killer kills it.

Milestone 6: Network namespace

Add CLONE_NEWNET. The container has its own network stack — initially empty. Add a veth pair if you want it to talk to the outside.

ip link add veth0 type veth peer name veth1
ip link set veth1 netns /proc/$PID/ns/net
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec <ns> ip addr add 10.0.0.2/24 dev veth1
ip netns exec <ns> ip link set veth1 up

Evidence: Container has its own loopback only by default. After veth setup, it can ping the host.

Milestone 7 (optional): Overlay filesystem (image layers)

Use overlayfs to stack a writable layer on top of a read-only base layer. This is how Docker images work.

mkdir lower upper work merged
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged

Evidence: Multiple containers share the same base image (lower), but each has its own writes (upper).

Milestone 8 (optional): seccomp filter

Restrict system calls. Block dangerous ones (reboot, kexec_load, mount, ...).

Milestone 9 (optional): User namespace

Map root inside the container to an unprivileged user outside. This is what "rootless containers" use.

8. Tests & evidence

Test	How
Hostname isolation	Container can change hostname; host unchanged
PID isolation	PID 1 inside; host PIDs invisible
Filesystem isolation	`ls /` shows different content inside vs outside
Memory limit	OOM kill triggers on attempted over-allocation
Network isolation	Container starts with only loopback; veth gives it access
Process lifecycle	Container exit cleans up cgroups and namespaces
Interop	A Docker image's rootfs can be used as your container's filesystem

The strongest single evidence: a side-by-side terminal recording showing the container is isolated from the host.

9. Common pitfalls

Re-exec required from Go. Go's runtime doesn't support clone() with namespace flags directly. The re-exec trick (call yourself with a different argument) is the standard workaround.
pivot_root is finicky. Order matters. Read man 2 pivot_root carefully.
/proc confusion. Inside the container, /proc must be a fresh mount, not the host's.
cgroup v1 vs v2. Different APIs and paths. Modern systems (RHEL 9, Ubuntu 22+) default to v2.
Capabilities. A container running as "root" inside still has capabilities from the host's perspective unless you drop them.
Mounts leaking back to the host. Use MS_PRIVATE propagation to prevent your container's mounts from polluting the host's mount table.
Forgetting to unshare user namespace. Without it, "root inside" really is root, with all its danger.
PID 1 special behavior. PID 1 has special signal handling (default signal handlers are no-ops). If your init process doesn't handle signals, the container may hang on shutdown.

10. Extensions

Container image format. Read a Docker image tarball (docker save) and use it as your container's filesystem.
OCI spec compliance. Implement enough of runc's interface that crun and runc can run your config.
Networking via CNI. The Container Network Interface plugin model.
Rootless mode. Drop the need for sudo. User namespaces are the trick.
Layered storage. Multiple overlay layers; image building.
Image building. A toy version of docker build from a Dockerfile.

11. Module integration

Module	What the container deepens
Sem 5 Module 1 — Processes & scheduling	`clone()` is `fork()` with flags. Each namespace is a "child" of the global resource.
Sem 5 Module 3 — Concurrency	Namespaces are concurrent views of shared kernel state.
Sem 5 Module 4 — File systems & I/O	Overlay, bind mounts, chroot.
Shell tutorial	Container = shell + isolation flags.
Sem 9 (Production phase) — Cloud / Kubernetes	Kubernetes orchestrates these primitives at scale. Knowing the foundation makes K8s much less mysterious.

12. Portfolio framing

What to publish:

Source (main.go, cmd_run.go, cmd_child.go, cgroups.go, network.go).
A README with a side-by-side terminal recording showing isolation.
A list of "what Docker has that this doesn't": layers, image registry, networking plugins, security policies.
A list of OCI features you implemented vs skipped.

What to keep private:

None — this is portfolio-grade. But be honest about security: a toy container runtime is not a security boundary. State this loudly.

Reviewer entry points:

cmd_run.go — the entry point. The Cloneflags line is the heart of it.
cgroups.go — resource limits.
README must include: a video/GIF of isolation demonstration; security caveats; reference to Liz Rice or Dixon as your starting point.

A 200-line container runtime is a striking portfolio piece because everyone uses Docker and almost no one understands it.

Source

This tutorial draws from the BYO-X catalog "Docker" section. Liz Rice's GopherCon talk and Lizzie Dixon's 500-line C version are the canonical primary tutorials.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

For depth​

5. Curated tutorial list (from BYO-X)​

6. Recommended primary path​

7. Implementation milestones (following Liz Rice's structure)​

Milestone 1: fork + exec (no isolation)​

Milestone 2: UTS namespace (hostname isolation)​

Milestone 3: PID namespace​

Milestone 4: Mount namespace + chroot​

Milestone 5: cgroups (memory limit)​

Milestone 6: Network namespace​

Milestone 7 (optional): Overlay filesystem (image layers)​

Milestone 8 (optional): seccomp filter​

Milestone 9 (optional): User namespace​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

Source​