Build Your Own Git
"Git is a content-addressable filesystem with a version-control system grafted on top." — Linus Torvalds
Building a minimal Git clone is the single best way to understand the tool you already use every day. The data model is small (four object types), the algorithms are simple (SHA-1 hashing, zlib compression, set difference), and the result is a working clone of git init, git add, git commit, git log, and git status in under 500 lines.
1. Overview & motivation
Git's data model has exactly four object types:
- Blob — file contents.
- Tree — directory listing (names + types + hash references).
- Commit — parent commit hash + tree hash + author + message.
- Tag — annotated reference to another object.
Every object is identified by the SHA-1 (or SHA-256 in newer Git) of its contents. Objects are stored on disk by their hash. Branches and tags are just files in .git/refs/ that contain hashes.
That's it. The rest of Git — staging, merging, rebasing, remotes — is operations on these four object types.
What you can only learn by building one:
- Why content-addressable storage is the right primitive for version control.
- Why
git statusis much more complex than it looks (compare working directory, index, HEAD). - Why
git rebaseis just "recreate commits onto a different parent." - Why packfiles exist (object-per-file is slow; pack them up).
- Why Git is immutable — branches move; objects never change.
2. Where this fits in the degree
- Phase: Systems
- Semester: 4 or 5 (Systems Programming / OS-Networking)
- Modules deepened: Sem 4 Module 1 (C/Python/Ruby fundamentals), Sem 5 Module 4 (file systems & I/O — Git is a small filesystem on top of a filesystem).
Cross-phase relevance:
- Connects to the Blockchain tutorial — both are content-addressable hash-chained structures.
- Familiar territory for the Database (KV) tutorial — Git's object database is essentially a hash-keyed KV store on disk.
3. Prerequisites
- A scripting language: Python or Ruby. (Or Haskell/Rust if you prefer the harder paths.)
- SHA-1: just
hashlib.sha1(...). No cryptography knowledge needed. - zlib:
zlib.compress/zlib.decompress. Just APIs. - Some familiarity with using Git from the command line.
4. Theory & research
Required reading
- Scott Chacon & Ben Straub, Pro Git, Chapter 10: "Git Internals" (git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain). Free online. The single canonical text on Git's data model. ⭐ read this first.
- Aditya Mukerjee, "Just enough of a Git client to create a repo, commit, and push itself to GitHub" — maryrosecook.com / github.com/MaryRoseCook/gitlet. Or the original (benhoyt.com/writings/pygit/).
Strongly recommended
- Thibault Polge, "Write yourself a Git!" — wyag.thb.lt. The single most thorough Python tutorial. ⭐ recommended primary path.
- Nikita Leshenko, "Git Internals: Plumbing and Porcelain" — shahriar.svbtle.com/becoming-a-git-master. Concise.
For depth
- Git source code (github.com/git/git) —
cache.h,object.c,commit.c. Production C. - Linus Torvalds' first Git commit message (April 2005) — read the comments in the very first version: github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290.
5. Curated tutorial list (from BYO-X)
- Haskell: Reimplementing "git clone" in Haskell from the bottom up
- JavaScript: Gitlet — MaryRoseCook/gitlet — 300 lines, includes diff and merge
- JavaScript: Build GIT - Learn GIT
- Python: Just enough of a Git client to create a repo, commit, and push itself to GitHub — Ben Hoyt's pygit — 500 lines, self-hosting
- Python: Write yourself a Git! — Thibault Polge, wyag ⭐ recommended primary
- Python: ugit: Learn Git Internals by Building Git Yourself — Nikita Leshenko, ugit.readthedocs.io — 11 incremental steps with tags
- Ruby: Rebuilding Git in Ruby
6. Recommended primary path
Thibault Polge, "Write yourself a Git!" (wyag). Python, ~1,000 lines. Covers:
- Init repository.
- Object storage (blob, tree, commit, tag).
- Reading and writing references.
add,commit,log.- Checkout.
- Status, diff.
Single document, very readable. Plan for 8–15 hours.
For a more guided, step-by-step version: ugit (also Python). 11 numbered steps with git tags so you can check your work after each.
For 500 lines and a self-hosting punchline: Ben Hoyt's pygit, where the final commit is the tool committing itself to GitHub.
7. Implementation milestones (following wyag-style structure)
Milestone 1: init
Create .mygit/ with objects/, refs/heads/, refs/tags/, HEAD (containing ref: refs/heads/main), config.
def cmd_init(path):
os.makedirs(f"{path}/.mygit/objects", exist_ok=True)
os.makedirs(f"{path}/.mygit/refs/heads", exist_ok=True)
os.makedirs(f"{path}/.mygit/refs/tags", exist_ok=True)
with open(f"{path}/.mygit/HEAD", "w") as f:
f.write("ref: refs/heads/main\n")
Evidence: mygit init, then tree .mygit matches expected structure.
Milestone 2: Object storage (blob)
Compute SHA-1 of "blob <length>\0<contents>". Compress with zlib. Store at .mygit/objects/<first-2-chars>/<remaining-38-chars>.
def hash_object(data, obj_type="blob", write=True):
header = f"{obj_type} {len(data)}\0".encode()
full = header + data
sha = hashlib.sha1(full).hexdigest()
if write:
path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "wb") as f:
f.write(zlib.compress(full))
return sha
def cat_file(sha):
path = f".mygit/objects/{sha[:2]}/{sha[2:]}"
with open(path, "rb") as f:
full = zlib.decompress(f.read())
null_idx = full.index(b"\x00")
header = full[:null_idx]
obj_type, length = header.decode().split(" ")
return obj_type, full[null_idx + 1:]
Evidence: Hash a file; verify the SHA matches what git hash-object produces.
Milestone 3: Tree objects
A tree is a sorted list of <mode> <name>\0<20-byte-sha> entries. Recursively walk a directory; each subdirectory becomes a sub-tree.
def write_tree(path):
entries = []
for name in sorted(os.listdir(path)):
if name.startswith("."): continue
full = os.path.join(path, name)
if os.path.isdir(full):
sha = write_tree(full)
mode = "40000"
else:
with open(full, "rb") as f: data = f.read()
sha = hash_object(data, "blob")
mode = "100644"
entries.append(f"{mode} {name}".encode() + b"\x00" + bytes.fromhex(sha))
tree_data = b"".join(entries)
return hash_object(tree_data, "tree")
Evidence: Tree of a directory matches git write-tree byte-for-byte.
Milestone 4: Commit objects
A commit is text: tree <sha>\nparent <sha>\nauthor ...\ncommitter ...\n\n<message>.
def commit(message):
tree_sha = write_tree(".")
head_sha = get_head_commit()
body = f"tree {tree_sha}\n"
if head_sha: body += f"parent {head_sha}\n"
body += f"author Me <me@example.com> {int(time.time())} +0000\n"
body += f"committer Me <me@example.com> {int(time.time())} +0000\n\n"
body += message + "\n"
sha = hash_object(body.encode(), "commit")
update_ref("refs/heads/main", sha)
return sha
Evidence: mygit commit "first", then git log (real Git in the same directory) shows the commit. Demonstrates the data model interoperates with real Git.
Milestone 5: Refs and HEAD
Resolve HEAD (which may be a symbolic ref) to a commit SHA. Update refs on commit.
def get_head_commit():
with open(".mygit/HEAD") as f:
content = f.read().strip()
if content.startswith("ref: "):
ref_path = content[5:]
try:
with open(f".mygit/{ref_path}") as f: return f.read().strip()
except FileNotFoundError:
return None
return content # detached HEAD
Evidence: Branch creation works. cat .mygit/refs/heads/main matches the latest commit SHA.
Milestone 6: log
Walk parent commits.
def cmd_log(start=None):
sha = start or get_head_commit()
while sha:
obj_type, data = cat_file(sha)
print(f"commit {sha}")
print(data.decode())
print()
match = re.search(b"^parent ([0-9a-f]{40})", data, re.MULTILINE)
sha = match.group(1).decode() if match else None
Evidence: mygit log matches git log --pretty=raw.
Milestone 7: add and the index
The index (staging area) is a binary file .mygit/index listing tracked paths with their stat data + blob SHA. add updates the index. commit builds a tree from the index, not the working directory.
This is the conceptually trickiest part of Git. Take time.
Evidence: mygit add foo.txt → mygit status shows it staged. Modify it without re-adding → status shows it both staged (old) and modified (new).
Milestone 8: status and diff
Compare three things: HEAD, index, working directory. Each pair produces a column of status output.
Changes to be committed: (HEAD vs index)
Changes not staged: (index vs working dir)
Untracked files: (in working dir, not in index)
diff uses a longest-common-subsequence algorithm or Myers' diff. For simplicity, start with line-by-line output and add proper diff later.
Evidence: Edit a file, run status — output matches the conceptual table above.
Milestone 9: checkout
Given a commit SHA, recursively reconstitute its tree onto the working directory.
Evidence: mygit checkout <old-sha> — files revert.
Milestone 10 (optional, ambitious): Push to GitHub
Implement enough of the smart HTTP protocol to push to GitHub. Ben Hoyt's pygit does this in ~150 lines.
This is where you confront: object packing, smart HTTP wire protocol, Git's negotiation algorithm.
8. Tests & evidence
| Test | How |
|---|---|
| Object hashing | SHA of a known string matches git hash-object |
| Tree byte format | Tree SHA matches real Git for the same directory |
| Commit interoperability | Real Git can read your commits and vice versa (this is the strongest evidence) |
status correctness | Three-way comparison produces matching output for many edit scenarios |
| Round trip | add → commit → checkout → diff — the diff against the new working dir should be empty |
| History | A 10-commit history with branches walks correctly |
The strongest single evidence: your mygit and real git can read each other's repositories.
9. Common pitfalls
- Different newline / whitespace in commit objects. Git is strict. A single missing newline changes the SHA.
- Wrong byte order in tree entries. They're packed binary, not text. The 20-byte SHA is binary, not hex.
- Hashing the wrong thing. Git hashes
<type> <len>\0<data>, not just<data>. - Forgetting to compress. Objects are zlib-compressed on disk.
- Index format confusion. Git's index has a binary format with a strict version. For a tutorial, define your own simpler index. State the incompatibility.
- Symlinks and executable bit. Git distinguishes file modes
100644(normal) and100755(executable). On Windows the executable bit is meaningless. State the simplification. - Trying to implement merge before basic operations work. Merge is the hard part. Get init/add/commit/log/checkout solid first.
10. Extensions
- Branches — already mostly there. Add
branch <name>andcheckout <branch>. - Merge — three-way merge. Find common ancestor; compute three-way diff; write result.
- Diff — Myers' algorithm. Not too bad.
- Tags — annotated tags are a new object type; lightweight tags are just refs.
- Remotes — smart HTTP protocol. Hard but the canonical way to learn Git's internals deeply.
- Pack files — Git's optimization for storing many small objects efficiently.
git gcrewrites a repository into pack files. - Garbage collection — reachability analysis from refs.
11. Module integration
| Module | What Git deepens |
|---|---|
| Sem 4 Module 1 — C/Python fundamentals | Solid project-sized program with multiple concerns. |
| Sem 5 Module 4 — File systems & I/O | Git is a small filesystem on top of a filesystem. |
| Blockchain tutorial | Both are content-addressable, hash-chained, append-only structures. |
| Database (KV) tutorial | Git's object store is a content-addressable KV. |
| Sem 7 architecture / DDD | Three-way comparison (HEAD/index/working) is a small but rich domain model. |
12. Portfolio framing
What to publish:
- Source organized as
mygit/{init,object,tree,commit,index,refs,...}.py. - A README that demonstrates the interoperability test: your
mygit commitproduces an object that realgit logreads. - Tests that hash known inputs and verify against
git hash-object.
Reviewer entry points:
mygit/object.py— hashing and storage.mygit/index.py— staging area (the most subtle code).mygit/commands/commit.py— the orchestration.- README must include: list of supported commands, list of unsupported commands, the interoperability demo.
This is a genuinely impressive portfolio project. "I wrote enough of Git to commit itself to GitHub" reads well anywhere.
Source
This tutorial draws from the BYO-X catalog "Git" section. Pro Git, Chapter 10 is the canonical Git internals reference. wyag, ugit, and pygit are the three best modern Python walkthroughs.