Skip to main content

OCI Images, Layers, and the Runtime

What This Concept Is

An OCI image is not a disk image. It is a content-addressed bundle of three things:

  • an image manifest: JSON listing the layers and config, each referenced by SHA-256 digest
  • a config blob: JSON holding the entrypoint, environment, working directory, labels, and the ordered list of diff IDs for the layers
  • one or more layer blobs: gzipped tarballs, each a filesystem diff relative to the previous layer

Everything is named by the cryptographic digest of its bytes. This means identical layers are shared between images on a host and verified end to end.

The OCI Runtime Specification is the separate contract a container runtime has to fulfill once an image has been unpacked: given a root filesystem and a config.json, launch a process inside a set of namespaces and cgroups. runc is the reference implementation.

Why It Matters Here

Kubernetes does not pull an image itself; it tells a runtime (via the Container Runtime Interface) to pull it, unpack the layers into a stacked filesystem, and start a process according to the runtime spec. Misbehavior in this pipeline shows up as:

  • ImagePullBackOff -- manifest unreachable or credentials wrong
  • ErrImagePull -- digest mismatch or layer corruption
  • CreateContainerConfigError -- runtime cannot assemble a valid config.json
  • exceptionally slow pod startup -- large layers or no layer cache on the node

If you think an image is "just a tarball of a filesystem," you will not debug these correctly.

Concrete Example

A minimal image manifest (trimmed):

{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:6e9f07f...",
"size": 1471
},
"layers": [
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:2db29710...", "size": 2811478 },
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:8d3ac3489...", "size": 2126715 }
]
}

The runtime unpacks these layers in order onto an overlay filesystem (typically overlayfs). Each layer is a read-only lower directory; the container writes into a new upper directory. The union appears to the container as one filesystem, but only the upper is mutable. That is why changes to / in a running container do not modify the image -- and why "committing a container to an image" is just snapshotting the upper layer.

Common Confusion / Misconception

"Each RUN in a Dockerfile creates a layer, and you want as few layers as possible."

Each RUN, COPY, and ADD creates a layer. The constraint is not "few layers" but "small, cacheable, and ordered from least-to-most-frequently-changing." A good Dockerfile puts slowly-changing dependencies before the application code, because the builder reuses a cached layer only if every prior layer's inputs match. Squashing down to one layer often makes rebuilds slower, not faster, because it defeats caching.

A second confusion: "latest is a version." It is not. It is a mutable tag. Two pulls of the same :latest on different days can give you different digests. Production manifests should pin by digest (image: nginx@sha256:abc...) when reproducibility matters.

How To Use It

When you read a Kubernetes pod spec, trace the chain:

  1. The image field resolves to a manifest by tag or digest at a registry.
  2. The kubelet asks the runtime, via CRI, to pull missing blobs.
  3. The runtime verifies digests, unpacks layers, stacks them via overlayfs.
  4. The runtime assembles an OCI config.json from the image config plus Pod spec (env, command, mounts, securityContext).
  5. runc or its substitute creates namespaces and cgroups and execs the entrypoint.

Layer Ordering in Practice

A good Dockerfile orders layers by churn -- slowest-changing things first, fastest-changing last. For a typical Python service:

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]

The cache reuses layers only when all prior inputs hash the same. Because requirements.txt rarely changes, pip install can be cached across dozens of builds. If you flipped the order -- COPY . . before RUN pip install -- every source change invalidates the install layer, and CI builds become minutes longer than they need to be.

Similarly, multi-stage builds exist to keep build tooling out of the runtime image:

FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN go build -o /out/app

FROM gcr.io/distroless/base-debian12
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]

The final image has no compiler, no shell, no package manager -- smaller, faster to pull, and far less attack surface.

Image Index, Platforms, and Supply Chain

A tag in a modern registry often resolves first to an image index (a.k.a. manifest list), not a single image manifest. The index has one entry per platform (linux/amd64, linux/arm64, windows/amd64, ...). The client picks the matching platform and pulls that entry.

This is how nginx:1.27 can transparently work on both an Intel server and an ARM laptop with the same tag -- the two are distinct images under one index.

Supply-chain tooling attaches additional artifacts to an image by digest:

  • SBOMs (CycloneDX, SPDX) listing every package and version baked in
  • Signatures (cosign, notation) proving who built and pushed the image
  • Attestations (in-toto, SLSA) describing the build provenance

All of these are separate OCI artifacts referenced by digest -- there is no :signed tag. Clusters enforce policy by verifying signatures against a keyring or a Sigstore root before admitting a Pod.

Check Yourself

  1. What exactly does a layer contain, and how are layers combined at runtime?
  2. Why does latest fail reproducibility even though it is a valid tag?
  3. What part of the OCI runtime spec does runc implement, and what part comes from the image?

Mini Drill or Application

Pull a small image (e.g. alpine:3.19) and inspect it:

docker pull alpine:3.19
docker image inspect alpine:3.19
docker save alpine:3.19 -o alpine.tar
mkdir alpine && tar -xf alpine.tar -C alpine

List the files in the extracted bundle. Identify the manifest, the config, and each layer's tarball. Then write a one-paragraph explanation of what the runtime will do with each piece on docker run.

Check Yourself (extended)

  1. What is the practical difference between an image index and an image manifest, and when do you see one instead of the other?
  2. A Dockerfile has COPY . . as step 2 and RUN pip install as step 10. What is wrong with the ordering from a caching perspective?
  3. An image signed with cosign does not have a :signed tag -- where does the signature live, and how does a cluster policy find it?

Content-Addressing: Same Idea as Git

OCI's content-addressing is essentially the same model Git uses for blobs, trees, and commits: every artifact is named by the SHA-256 of its bytes, and higher-level objects reference lower ones by digest. That parallel is not a coincidence -- both systems solve verifiable, deduplicated storage of immutable trees.

Practical consequences that carry over from one world to the other:

  • Two clients pushing the "same" layer upload it once; the registry deduplicates by digest, exactly like Git packfiles deduplicate blobs.
  • A pulled image whose layer digest does not match what the manifest advertises is rejected, the same way git fsck would reject a corrupted blob.
  • Tags (like Git branches) are mutable pointers at digests (like Git commits). A digest pin (image@sha256:…) is the container equivalent of git checkout <sha>.

If you already understand Git internals, OCI's vocabulary (manifest, config, layer, index) is just a relabelling of commit, tree, blob, refs, with registries playing the role of remotes.

Read This Only If Stuck