OCI Images, Layers, and the Runtime
What This Concept Is
An OCI image is not a disk image. It is a content-addressed bundle of three things:
- an image manifest: JSON listing the layers and config, each referenced by SHA-256 digest
- a config blob: JSON holding the entrypoint, environment, working directory, labels, and the ordered list of diff IDs for the layers
- one or more layer blobs: gzipped tarballs, each a filesystem diff relative to the previous layer
Everything is named by the cryptographic digest of its bytes. This means identical layers are shared between images on a host and verified end to end.
The OCI Runtime Specification is the separate contract a container runtime has to fulfill once an image has been unpacked: given a root filesystem and a config.json, launch a process inside a set of namespaces and cgroups. runc is the reference implementation.
Why It Matters Here
Kubernetes does not pull an image itself; it tells a runtime (via the Container Runtime Interface) to pull it, unpack the layers into a stacked filesystem, and start a process according to the runtime spec. Misbehavior in this pipeline shows up as:
ImagePullBackOff-- manifest unreachable or credentials wrongErrImagePull-- digest mismatch or layer corruptionCreateContainerConfigError-- runtime cannot assemble a validconfig.json- exceptionally slow pod startup -- large layers or no layer cache on the node
If you think an image is "just a tarball of a filesystem," you will not debug these correctly.
Concrete Example
A minimal image manifest (trimmed):
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.manifest.v1+json",
"config": {
"mediaType": "application/vnd.oci.image.config.v1+json",
"digest": "sha256:6e9f07f...",
"size": 1471
},
"layers": [
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:2db29710...", "size": 2811478 },
{ "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:8d3ac3489...", "size": 2126715 }
]
}
The runtime unpacks these layers in order onto an overlay filesystem (typically overlayfs). Each layer is a read-only lower directory; the container writes into a new upper directory. The union appears to the container as one filesystem, but only the upper is mutable. That is why changes to / in a running container do not modify the image -- and why "committing a container to an image" is just snapshotting the upper layer.
Common Confusion / Misconception
"Each RUN in a Dockerfile creates a layer, and you want as few layers as possible."
Each RUN, COPY, and ADD creates a layer. The constraint is not "few layers" but "small, cacheable, and ordered from least-to-most-frequently-changing." A good Dockerfile puts slowly-changing dependencies before the application code, because the builder reuses a cached layer only if every prior layer's inputs match. Squashing down to one layer often makes rebuilds slower, not faster, because it defeats caching.
A second confusion: "latest is a version." It is not. It is a mutable tag. Two pulls of the same :latest on different days can give you different digests. Production manifests should pin by digest (image: nginx@sha256:abc...) when reproducibility matters.
How To Use It
When you read a Kubernetes pod spec, trace the chain:
- The
imagefield resolves to a manifest by tag or digest at a registry. - The kubelet asks the runtime, via CRI, to pull missing blobs.
- The runtime verifies digests, unpacks layers, stacks them via
overlayfs. - The runtime assembles an OCI
config.jsonfrom the image config plus Pod spec (env, command, mounts, securityContext). runcor its substitute creates namespaces and cgroups and execs the entrypoint.
Layer Ordering in Practice
A good Dockerfile orders layers by churn -- slowest-changing things first, fastest-changing last. For a typical Python service:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "app.py"]
The cache reuses layers only when all prior inputs hash the same. Because requirements.txt rarely changes, pip install can be cached across dozens of builds. If you flipped the order -- COPY . . before RUN pip install -- every source change invalidates the install layer, and CI builds become minutes longer than they need to be.
Similarly, multi-stage builds exist to keep build tooling out of the runtime image:
FROM golang:1.22 AS build
WORKDIR /src
COPY . .
RUN go build -o /out/app
FROM gcr.io/distroless/base-debian12
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]
The final image has no compiler, no shell, no package manager -- smaller, faster to pull, and far less attack surface.
Image Index, Platforms, and Supply Chain
A tag in a modern registry often resolves first to an image index (a.k.a. manifest list), not a single image manifest. The index has one entry per platform (linux/amd64, linux/arm64, windows/amd64, ...). The client picks the matching platform and pulls that entry.
This is how nginx:1.27 can transparently work on both an Intel server and an ARM laptop with the same tag -- the two are distinct images under one index.
Supply-chain tooling attaches additional artifacts to an image by digest:
- SBOMs (CycloneDX, SPDX) listing every package and version baked in
- Signatures (cosign, notation) proving who built and pushed the image
- Attestations (in-toto, SLSA) describing the build provenance
All of these are separate OCI artifacts referenced by digest -- there is no :signed tag. Clusters enforce policy by verifying signatures against a keyring or a Sigstore root before admitting a Pod.
Check Yourself
- What exactly does a layer contain, and how are layers combined at runtime?
- Why does
latestfail reproducibility even though it is a valid tag? - What part of the OCI runtime spec does
runcimplement, and what part comes from the image?
Mini Drill or Application
Pull a small image (e.g. alpine:3.19) and inspect it:
docker pull alpine:3.19
docker image inspect alpine:3.19
docker save alpine:3.19 -o alpine.tar
mkdir alpine && tar -xf alpine.tar -C alpine
List the files in the extracted bundle. Identify the manifest, the config, and each layer's tarball. Then write a one-paragraph explanation of what the runtime will do with each piece on docker run.
Check Yourself (extended)
- What is the practical difference between an image index and an image manifest, and when do you see one instead of the other?
- A Dockerfile has
COPY . .as step 2 andRUN pip installas step 10. What is wrong with the ordering from a caching perspective? - An image signed with
cosigndoes not have a:signedtag -- where does the signature live, and how does a cluster policy find it?
Content-Addressing: Same Idea as Git
OCI's content-addressing is essentially the same model Git uses for blobs, trees, and commits: every artifact is named by the SHA-256 of its bytes, and higher-level objects reference lower ones by digest. That parallel is not a coincidence -- both systems solve verifiable, deduplicated storage of immutable trees.
Practical consequences that carry over from one world to the other:
- Two clients pushing the "same" layer upload it once; the registry deduplicates by digest, exactly like Git packfiles deduplicate blobs.
- A pulled image whose layer digest does not match what the manifest advertises is rejected, the same way
git fsckwould reject a corrupted blob. - Tags (like Git branches) are mutable pointers at digests (like Git commits). A digest pin (
image@sha256:…) is the container equivalent ofgit checkout <sha>.
If you already understand Git internals, OCI's vocabulary (manifest, config, layer, index) is just a relabelling of commit, tree, blob, refs, with registries playing the role of remotes.
Read This Only If Stuck
- Pro Git: Git objects -- the same content-addressed object model OCI adopts.
- Pro Git: Tree objects -- why referencing children by digest makes the whole tree verifiable.
- Pro Git: Packfiles -- Git's deduplication of identical blobs mirrors a registry's deduplication of identical layers.
- Linux Command Line: Mounting and unmounting storage devices -- what
overlayfsand bind mounts are layered on top of. - OCI Image Format Specification -- normative definition of manifests, configs, and layer media types.
- OCI Runtime Specification -- what a runtime must do given a root filesystem and
config.json. - OCI Image Layer Filesystem Changeset -- tar format, whiteouts, and diff semantics inside a layer blob.
- Docker: Dockerfile best practices -- canonical layer-ordering and multi-stage guidance.
- Docker: Build cache -- the mental model for why
COPY . .late is a cache hit and early is a miss. - Sigstore / cosign: signing OCI artifacts -- how signatures and attestations attach to images by digest.