Skip to main content

Build Your Own Git

"Git is a content-addressable filesystem with a version-control system grafted on top." — Linus Torvalds

Building a minimal Git clone is the single best way to understand the tool you already use every day. The data model is small (four object types), the algorithms are simple (SHA-1 hashing, zlib compression, set difference), and the result is a working clone of git init, git add, git commit, git log, and git status in under 500 lines.


1. Overview & motivation

Git's data model has exactly four object types:

  • Blob — file contents.
  • Tree — directory listing (names + types + hash references).
  • Commit — parent commit hash + tree hash + author + message.
  • Tag — annotated reference to another object.

Every object is identified by the SHA-1 (or SHA-256 in newer Git) of its contents. Objects are stored on disk by their hash. Branches and tags are just files in .git/refs/ that contain hashes.

That's it. The rest of Git — staging, merging, rebasing, remotes — is operations on these four object types.

What you can only learn by building one:

  • Why content-addressable storage is the right primitive for version control.
  • Why git status is much more complex than it looks (compare working directory, index, HEAD).
  • Why git rebase is just "recreate commits onto a different parent."
  • Why packfiles exist (object-per-file is slow; pack them up).
  • Why Git is immutable — branches move; objects never change.

2. Where this fits in the degree

  • Phase: Systems
  • Semester: 4 or 5 (Systems Programming / OS-Networking)
  • Modules deepened: Sem 4 Module 1 (C/Python/Ruby fundamentals), Sem 5 Module 4 (file systems & I/O — Git is a small filesystem on top of a filesystem).

Cross-phase relevance:

  • Connects to the Blockchain tutorial — both are content-addressable hash-chained structures.
  • Familiar territory for the Database (KV) tutorial — Git's object database is essentially a hash-keyed KV store on disk.

3. Prerequisites

  • A scripting language: Python or Ruby. (Or Haskell/Rust if you prefer the harder paths.)
  • SHA-1: just hashlib.sha1(...). No cryptography knowledge needed.
  • zlib: zlib.compress / zlib.decompress. Just APIs.
  • Some familiarity with using Git from the command line.

4. Theory & research

Required reading

For depth


5. Curated tutorial list (from BYO-X)

  • Haskell: Reimplementing "git clone" in Haskell from the bottom up
  • JavaScript: GitletMaryRoseCook/gitlet — 300 lines, includes diff and merge
  • JavaScript: Build GIT - Learn GIT
  • Python: Just enough of a Git client to create a repo, commit, and push itself to GitHubBen Hoyt's pygit — 500 lines, self-hosting
  • Python: Write yourself a Git!Thibault Polge, wyagrecommended primary
  • Python: ugit: Learn Git Internals by Building Git YourselfNikita Leshenko, ugit.readthedocs.io — 11 incremental steps with tags
  • Ruby: Rebuilding Git in Ruby

Thibault Polge, "Write yourself a Git!" (wyag). Python, ~1,000 lines. Covers:

  1. Init repository.
  2. Object storage (blob, tree, commit, tag).
  3. Reading and writing references.
  4. add, commit, log.
  5. Checkout.
  6. Status, diff.

Single document, very readable. Plan for 8–15 hours.

For a more guided, step-by-step version: ugit (also Python). 11 numbered steps with git tags so you can check your work after each.

For 500 lines and a self-hosting punchline: Ben Hoyt's pygit, where the final commit is the tool committing itself to GitHub.


7. Implementation milestones (following wyag-style structure)

Milestone 1: init

Create .mygit/ with objects/, refs/heads/, refs/tags/, HEAD (containing ref: refs/heads/main), config.

def cmd_init(path):
os.makedirs(f"{path}/.mygit/objects", exist_ok=True)
os.makedirs(f"{path}/.mygit/refs/heads", exist_ok=True)
os.makedirs(f"{path}/.mygit/refs/tags", exist_ok=True)
with open(f"{path}/.mygit/HEAD", "w") as f:
f.write("ref: refs/heads/main\n")

Evidence: mygit init, then tree .mygit matches expected structure.

Milestone 2: Object storage (blob)

Compute SHA-1 of "blob <length>\0<contents>". Compress with zlib. Store at .mygit/objects/<first-2-chars>/<remaining-38-chars>.

def hash_object(data, obj_type="blob", write=True):
header = f"{obj_type} {len(data)}\0".encode()
full = header + data
sha = hashlib.sha1(full).hexdigest()
if write:
path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "wb") as f:
f.write(zlib.compress(full))
return sha

def cat_file(sha):
path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
with open(path, "rb") as f:
full = zlib.decompress(f.read())
null_idx = full.index(b"\x00")
header = full[:null_idx]
obj_type, length = header.decode().split(" ")
return obj_type, full[null_idx + 1:]

Evidence: Hash a file; verify the SHA matches what git hash-object produces.

Milestone 3: Tree objects

A tree is a sorted list of <mode> <name>\0<20-byte-sha> entries. Recursively walk a directory; each subdirectory becomes a sub-tree.

def write_tree(path):
entries = []
for name in sorted(os.listdir(path)):
if name.startswith("."): continue
full = os.path.join(path, name)
if os.path.isdir(full):
sha = write_tree(full)
mode = "40000"
else:
with open(full, "rb") as f: data = f.read()
sha = hash_object(data, "blob")
mode = "100644"
entries.append(f"{mode} {name}".encode() + b"\x00" + bytes.fromhex(sha))
tree_data = b"".join(entries)
return hash_object(tree_data, "tree")

Evidence: Tree of a directory matches git write-tree byte-for-byte.

Milestone 4: Commit objects

A commit is text: tree <sha>\nparent <sha>\nauthor ...\ncommitter ...\n\n<message>.

def commit(message):
tree_sha = write_tree(".")
head_sha = get_head_commit()
body = f"tree {tree_sha}\n"
if head_sha: body += f"parent {head_sha}\n"
body += f"author Me <me@example.com> {int(time.time())} +0000\n"
body += f"committer Me <me@example.com> {int(time.time())} +0000\n\n"
body += message + "\n"
sha = hash_object(body.encode(), "commit")
update_ref("refs/heads/main", sha)
return sha

Evidence: mygit commit "first", then git log (real Git in the same directory) shows the commit. Demonstrates the data model interoperates with real Git.

Milestone 5: Refs and HEAD

Resolve HEAD (which may be a symbolic ref) to a commit SHA. Update refs on commit.

def get_head_commit():
with open(".mygit/HEAD") as f:
content = f.read().strip()
if content.startswith("ref: "):
ref_path = content[5:]
try:
with open(f".mygit/{ref_path}") as f: return f.read().strip()
except FileNotFoundError:
return None
return content # detached HEAD

Evidence: Branch creation works. cat .mygit/refs/heads/main matches the latest commit SHA.

Milestone 6: log

Walk parent commits.

def cmd_log(start=None):
sha = start or get_head_commit()
while sha:
obj_type, data = cat_file(sha)
print(f"commit {sha}")
print(data.decode())
print()
match = re.search(b"^parent ([0-9a-f]{40})", data, re.MULTILINE)
sha = match.group(1).decode() if match else None

Evidence: mygit log matches git log --pretty=raw.

Milestone 7: add and the index

The index (staging area) is a binary file .mygit/index listing tracked paths with their stat data + blob SHA. add updates the index. commit builds a tree from the index, not the working directory.

This is the conceptually trickiest part of Git. Take time.

Evidence: mygit add foo.txtmygit status shows it staged. Modify it without re-adding → status shows it both staged (old) and modified (new).

Milestone 8: status and diff

Compare three things: HEAD, index, working directory. Each pair produces a column of status output.

Changes to be committed:   (HEAD vs index)
Changes not staged: (index vs working dir)
Untracked files: (in working dir, not in index)

diff uses a longest-common-subsequence algorithm or Myers' diff. For simplicity, start with line-by-line output and add proper diff later.

Evidence: Edit a file, run status — output matches the conceptual table above.

Milestone 9: checkout

Given a commit SHA, recursively reconstitute its tree onto the working directory.

Evidence: mygit checkout <old-sha> — files revert.

Milestone 10 (optional, ambitious): Push to GitHub

Implement enough of the smart HTTP protocol to push to GitHub. Ben Hoyt's pygit does this in ~150 lines.

This is where you confront: object packing, smart HTTP wire protocol, Git's negotiation algorithm.


8. Tests & evidence

TestHow
Object hashingSHA of a known string matches git hash-object
Tree byte formatTree SHA matches real Git for the same directory
Commit interoperabilityReal Git can read your commits and vice versa (this is the strongest evidence)
status correctnessThree-way comparison produces matching output for many edit scenarios
Round tripadd → commit → checkout → diff — the diff against the new working dir should be empty
HistoryA 10-commit history with branches walks correctly

The strongest single evidence: your mygit and real git can read each other's repositories.


9. Common pitfalls

  • Different newline / whitespace in commit objects. Git is strict. A single missing newline changes the SHA.
  • Wrong byte order in tree entries. They're packed binary, not text. The 20-byte SHA is binary, not hex.
  • Hashing the wrong thing. Git hashes <type> <len>\0<data>, not just <data>.
  • Forgetting to compress. Objects are zlib-compressed on disk.
  • Index format confusion. Git's index has a binary format with a strict version. For a tutorial, define your own simpler index. State the incompatibility.
  • Symlinks and executable bit. Git distinguishes file modes 100644 (normal) and 100755 (executable). On Windows the executable bit is meaningless. State the simplification.
  • Trying to implement merge before basic operations work. Merge is the hard part. Get init/add/commit/log/checkout solid first.

10. Extensions

  • Branches — already mostly there. Add branch <name> and checkout <branch>.
  • Merge — three-way merge. Find common ancestor; compute three-way diff; write result.
  • Diff — Myers' algorithm. Not too bad.
  • Tags — annotated tags are a new object type; lightweight tags are just refs.
  • Remotes — smart HTTP protocol. Hard but the canonical way to learn Git's internals deeply.
  • Pack files — Git's optimization for storing many small objects efficiently. git gc rewrites a repository into pack files.
  • Garbage collection — reachability analysis from refs.

11. Module integration

ModuleWhat Git deepens
Sem 4 Module 1 — C/Python fundamentalsSolid project-sized program with multiple concerns.
Sem 5 Module 4 — File systems & I/OGit is a small filesystem on top of a filesystem.
Blockchain tutorialBoth are content-addressable, hash-chained, append-only structures.
Database (KV) tutorialGit's object store is a content-addressable KV.
Sem 7 architecture / DDDThree-way comparison (HEAD/index/working) is a small but rich domain model.

12. Portfolio framing

What to publish:

  • Source organized as mygit/{init,object,tree,commit,index,refs,...}.py.
  • A README that demonstrates the interoperability test: your mygit commit produces an object that real git log reads.
  • Tests that hash known inputs and verify against git hash-object.

Reviewer entry points:

  • mygit/object.py — hashing and storage.
  • mygit/index.py — staging area (the most subtle code).
  • mygit/commands/commit.py — the orchestration.
  • README must include: list of supported commands, list of unsupported commands, the interoperability demo.

This is a genuinely impressive portfolio project. "I wrote enough of Git to commit itself to GitHub" reads well anywhere.


Source

This tutorial draws from the BYO-X catalog "Git" section. Pro Git, Chapter 10 is the canonical Git internals reference. wyag, ugit, and pygit are the three best modern Python walkthroughs.