Build Your Own Git

"Git is a content-addressable filesystem with a version-control system grafted on top." — Linus Torvalds

Building a minimal Git clone is the single best way to understand the tool you already use every day. The data model is small (four object types), the algorithms are simple (SHA-1 hashing, zlib compression, set difference), and the result is a working clone of git init, git add, git commit, git log, and git status in under 500 lines.

1. Overview & motivation

Git's data model has exactly four object types:

Blob — file contents.
Tree — directory listing (names + types + hash references).
Commit — parent commit hash + tree hash + author + message.
Tag — annotated reference to another object.

Every object is identified by the SHA-1 (or SHA-256 in newer Git) of its contents. Objects are stored on disk by their hash. Branches and tags are just files in .git/refs/ that contain hashes.

That's it. The rest of Git — staging, merging, rebasing, remotes — is operations on these four object types.

What you can only learn by building one:

Why content-addressable storage is the right primitive for version control.
Why git status is much more complex than it looks (compare working directory, index, HEAD).
Why git rebase is just "recreate commits onto a different parent."
Why packfiles exist (object-per-file is slow; pack them up).
Why Git is immutable — branches move; objects never change.

2. Where this fits in the degree

Phase: Systems
Semester: 4 or 5 (Systems Programming / OS-Networking)
Modules deepened: Sem 4 Module 1 (C/Python/Ruby fundamentals), Sem 5 Module 4 (file systems & I/O — Git is a small filesystem on top of a filesystem).

Cross-phase relevance:

Connects to the Blockchain tutorial — both are content-addressable hash-chained structures.
Familiar territory for the Database (KV) tutorial — Git's object database is essentially a hash-keyed KV store on disk.

3. Prerequisites

A scripting language: Python or Ruby. (Or Haskell/Rust if you prefer the harder paths.)
SHA-1: just hashlib.sha1(...). No cryptography knowledge needed.
zlib: zlib.compress / zlib.decompress. Just APIs.
Some familiarity with using Git from the command line.

4. Theory & research

Required reading

Scott Chacon & Ben Straub, Pro Git, Chapter 10: "Git Internals" (git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain). Free online. The single canonical text on Git's data model. ⭐ read this first.
Aditya Mukerjee, "Just enough of a Git client to create a repo, commit, and push itself to GitHub" — maryrosecook.com / github.com/MaryRoseCook/gitlet. Or the original (benhoyt.com/writings/pygit/).

Strongly recommended

Thibault Polge, "Write yourself a Git!" — wyag.thb.lt. The single most thorough Python tutorial. ⭐ recommended primary path.
Nikita Leshenko, "Git Internals: Plumbing and Porcelain" — shahriar.svbtle.com/becoming-a-git-master. Concise.

For depth

Git source code (github.com/git/git) — cache.h, object.c, commit.c. Production C.
Linus Torvalds' first Git commit message (April 2005) — read the comments in the very first version: github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290.

5. Curated tutorial list (from BYO-X)

Haskell: Reimplementing "git clone" in Haskell from the bottom up
JavaScript: Gitlet — MaryRoseCook/gitlet — 300 lines, includes diff and merge
JavaScript: Build GIT - Learn GIT
Python: Just enough of a Git client to create a repo, commit, and push itself to GitHub — Ben Hoyt's pygit — 500 lines, self-hosting
Python: Write yourself a Git! — Thibault Polge, wyag ⭐ recommended primary
Python: ugit: Learn Git Internals by Building Git Yourself — Nikita Leshenko, ugit.readthedocs.io — 11 incremental steps with tags
Ruby: Rebuilding Git in Ruby

6. Recommended primary path

Thibault Polge, "Write yourself a Git!" (wyag). Python, ~1,000 lines. Covers:

Init repository.
Object storage (blob, tree, commit, tag).
Reading and writing references.
add, commit, log.
Checkout.
Status, diff.

Single document, very readable. Plan for 8–15 hours.

For a more guided, step-by-step version: ugit (also Python). 11 numbered steps with git tags so you can check your work after each.

For 500 lines and a self-hosting punchline: Ben Hoyt's pygit, where the final commit is the tool committing itself to GitHub.

7. Implementation milestones (following wyag-style structure)

Milestone 1: `init`

Create .mygit/ with objects/, refs/heads/, refs/tags/, HEAD (containing ref: refs/heads/main), config.

def cmd_init(path):
    os.makedirs(f"{path}/.mygit/objects", exist_ok=True)
    os.makedirs(f"{path}/.mygit/refs/heads", exist_ok=True)
    os.makedirs(f"{path}/.mygit/refs/tags", exist_ok=True)
    with open(f"{path}/.mygit/HEAD", "w") as f:
        f.write("ref: refs/heads/main\n")

Evidence: mygit init, then tree .mygit matches expected structure.

Milestone 2: Object storage (blob)

Compute SHA-1 of "blob <length>\0<contents>". Compress with zlib. Store at .mygit/objects/<first-2-chars>/<remaining-38-chars>.

def hash_object(data, obj_type="blob", write=True):
    header = f"{obj_type} {len(data)}\0".encode()
    full = header + data
    sha = hashlib.sha1(full).hexdigest()
    if write:
        path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "wb") as f:
            f.write(zlib.compress(full))
    return sha

def cat_file(sha):
    path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
    with open(path, "rb") as f:
        full = zlib.decompress(f.read())
    null_idx = full.index(b"\x00")
    header = full[:null_idx]
    obj_type, length = header.decode().split(" ")
    return obj_type, full[null_idx + 1:]

Evidence: Hash a file; verify the SHA matches what git hash-object produces.

Milestone 3: Tree objects

A tree is a sorted list of <mode> <name>\0<20-byte-sha> entries. Recursively walk a directory; each subdirectory becomes a sub-tree.

def write_tree(path):
    entries = []
    for name in sorted(os.listdir(path)):
        if name.startswith("."): continue
        full = os.path.join(path, name)
        if os.path.isdir(full):
            sha = write_tree(full)
            mode = "40000"
        else:
            with open(full, "rb") as f: data = f.read()
            sha = hash_object(data, "blob")
            mode = "100644"
        entries.append(f"{mode} {name}".encode() + b"\x00" + bytes.fromhex(sha))
    tree_data = b"".join(entries)
    return hash_object(tree_data, "tree")

Evidence: Tree of a directory matches git write-tree byte-for-byte.

Milestone 4: Commit objects

A commit is text: tree <sha>\nparent <sha>\nauthor ...\ncommitter ...\n\n<message>.

def commit(message):
    tree_sha = write_tree(".")
    head_sha = get_head_commit()
    body = f"tree {tree_sha}\n"
    if head_sha: body += f"parent {head_sha}\n"
    body += f"author Me <me@example.com> {int(time.time())} +0000\n"
    body += f"committer Me <me@example.com> {int(time.time())} +0000\n\n"
    body += message + "\n"
    sha = hash_object(body.encode(), "commit")
    update_ref("refs/heads/main", sha)
    return sha

Evidence: mygit commit "first", then git log (real Git in the same directory) shows the commit. Demonstrates the data model interoperates with real Git.

Milestone 5: Refs and HEAD

Resolve HEAD (which may be a symbolic ref) to a commit SHA. Update refs on commit.

def get_head_commit():
    with open(".mygit/HEAD") as f:
        content = f.read().strip()
    if content.startswith("ref: "):
        ref_path = content[5:]
        try:
            with open(f".mygit/{ref_path}") as f: return f.read().strip()
        except FileNotFoundError:
            return None
    return content  # detached HEAD

Evidence: Branch creation works. cat .mygit/refs/heads/main matches the latest commit SHA.

Milestone 6: `log`

Walk parent commits.

def cmd_log(start=None):
    sha = start or get_head_commit()
    while sha:
        obj_type, data = cat_file(sha)
        print(f"commit {sha}")
        print(data.decode())
        print()
        match = re.search(b"^parent ([0-9a-f]{40})", data, re.MULTILINE)
        sha = match.group(1).decode() if match else None

Evidence: mygit log matches git log --pretty=raw.

Milestone 7: `add` and the index

The index (staging area) is a binary file .mygit/index listing tracked paths with their stat data + blob SHA. add updates the index. commit builds a tree from the index, not the working directory.

This is the conceptually trickiest part of Git. Take time.

Evidence: mygit add foo.txt → mygit status shows it staged. Modify it without re-adding → status shows it both staged (old) and modified (new).

Milestone 8: `status` and `diff`

Compare three things: HEAD, index, working directory. Each pair produces a column of status output.

Changes to be committed:   (HEAD vs index)
Changes not staged:        (index vs working dir)
Untracked files:           (in working dir, not in index)

diff uses a longest-common-subsequence algorithm or Myers' diff. For simplicity, start with line-by-line output and add proper diff later.

Evidence: Edit a file, run status — output matches the conceptual table above.

Milestone 9: `checkout`

Given a commit SHA, recursively reconstitute its tree onto the working directory.

Evidence: mygit checkout <old-sha> — files revert.

Milestone 10 (optional, ambitious): Push to GitHub

Implement enough of the smart HTTP protocol to push to GitHub. Ben Hoyt's pygit does this in ~150 lines.

This is where you confront: object packing, smart HTTP wire protocol, Git's negotiation algorithm.

8. Tests & evidence

Test	How
Object hashing	SHA of a known string matches `git hash-object`
Tree byte format	Tree SHA matches real Git for the same directory
Commit interoperability	Real Git can read your commits and vice versa (this is the strongest evidence)
`status` correctness	Three-way comparison produces matching output for many edit scenarios
Round trip	`add → commit → checkout → diff` — the diff against the new working dir should be empty
History	A 10-commit history with branches walks correctly

The strongest single evidence: your mygit and real git can read each other's repositories.

9. Common pitfalls

Different newline / whitespace in commit objects. Git is strict. A single missing newline changes the SHA.
Wrong byte order in tree entries. They're packed binary, not text. The 20-byte SHA is binary, not hex.
Hashing the wrong thing. Git hashes <type> <len>\0<data>, not just <data>.
Forgetting to compress. Objects are zlib-compressed on disk.
Index format confusion. Git's index has a binary format with a strict version. For a tutorial, define your own simpler index. State the incompatibility.
Symlinks and executable bit. Git distinguishes file modes 100644 (normal) and 100755 (executable). On Windows the executable bit is meaningless. State the simplification.
Trying to implement merge before basic operations work. Merge is the hard part. Get init/add/commit/log/checkout solid first.

10. Extensions

Branches — already mostly there. Add branch <name> and checkout <branch>.
Merge — three-way merge. Find common ancestor; compute three-way diff; write result.
Diff — Myers' algorithm. Not too bad.
Tags — annotated tags are a new object type; lightweight tags are just refs.
Remotes — smart HTTP protocol. Hard but the canonical way to learn Git's internals deeply.
Pack files — Git's optimization for storing many small objects efficiently. git gc rewrites a repository into pack files.
Garbage collection — reachability analysis from refs.

11. Module integration

Module	What Git deepens
Sem 4 Module 1 — C/Python fundamentals	Solid project-sized program with multiple concerns.
Sem 5 Module 4 — File systems & I/O	Git is a small filesystem on top of a filesystem.
Blockchain tutorial	Both are content-addressable, hash-chained, append-only structures.
Database (KV) tutorial	Git's object store is a content-addressable KV.
Sem 7 architecture / DDD	Three-way comparison (HEAD/index/working) is a small but rich domain model.

12. Portfolio framing

What to publish:

Source organized as mygit/{init,object,tree,commit,index,refs,...}.py.
A README that demonstrates the interoperability test: your mygit commit produces an object that real git log reads.
Tests that hash known inputs and verify against git hash-object.

Reviewer entry points:

mygit/object.py — hashing and storage.
mygit/index.py — staging area (the most subtle code).
mygit/commands/commit.py — the orchestration.
README must include: list of supported commands, list of unsupported commands, the interoperability demo.

This is a genuinely impressive portfolio project. "I wrote enough of Git to commit itself to GitHub" reads well anywhere.

Source

This tutorial draws from the BYO-X catalog "Git" section. Pro Git, Chapter 10 is the canonical Git internals reference. wyag, ugit, and pygit are the three best modern Python walkthroughs.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

For depth​

5. Curated tutorial list (from BYO-X)​

6. Recommended primary path​

7. Implementation milestones (following wyag-style structure)​

Milestone 1: init​

Milestone 2: Object storage (blob)​

Milestone 3: Tree objects​

Milestone 4: Commit objects​

Milestone 5: Refs and HEAD​

Milestone 6: log​

Milestone 7: add and the index​

Milestone 8: status and diff​

Milestone 9: checkout​

Milestone 10 (optional, ambitious): Push to GitHub​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

Source​