Build Your Own LLM (GPT from Scratch)
"By the end of this lecture, you will have a 10-million-parameter language model that generates Shakespeare-like text. It will be small but real." -- Andrej Karpathy, "Let's build GPT"
Building a small GPT-style language model from scratch is the cleanest possible path from "I understand backprop" to "I understand modern AI." You build the tokenizer, the attention mechanism, the transformer block, the training loop. The result trains on a single GPU (or even a fast CPU) and writes recognizable English. 300 lines of Python.
This is the natural continuation of the Neural Network tutorial -- same audience, same tools, same author for the canonical primary path (Karpathy's "Zero to Hero" series).
1. Overview & motivation
A transformer-based language model has these components:
text -> [tokenizer] -> token ids
ids -> [embed] -> vectors per token
vec -> [transformer blocks x N]
-> ([self-attention] -> [feed-forward]) x N
out -> [unembedding] -> logits over vocabulary
logits -> [softmax] -> probability distribution
-> [sample] -> next token
You build every piece. By the end:
- You can train a 10M-100M parameter model that writes coherent text in your domain (Shakespeare, code, recipes, anything).
- You understand what every line of a real transformer does.
- You can read modern ML papers and recognize the constructions in code.
What you can only learn by building one:
- Why attention is a soft weighted lookup -- and why that's such a powerful primitive.
- Why causal masking is what makes a language model "language" (predict next, not all).
- Why positional embeddings exist (attention is permutation-invariant; we need order).
- Why layer normalization stabilizes deep transformers (and why everyone moved from LayerNorm to RMSNorm).
- Why training a 100M-parameter model is not hard -- but training a 100B-parameter model is.
2. Where this fits in the degree
- Phase: Foundations
- Semester: 1 (Math Foundations) + Sem 2 (Algorithms)
- Modules deepened:
- Sem 1 Module 4 (linear algebra) -- attention is
softmax(Q K^T / √d) V. Every line is matrix multiplication. - Sem 1 Module 5 (probability / statistics) -- cross-entropy loss, sampling from a categorical distribution, temperature.
- Sem 2 Module 4 (DP) -- backprop through the entire transformer is one big DAG. Same backward-pass machinery as in the Neural Network tutorial.
- Sem 1 Module 4 (linear algebra) -- attention is
Cross-phase relevance:
- Direct extension of the Neural Network tutorial. Use the autograd engine you built there.
- Connects to modern AI engineering, search relevance, code generation.
- The tokenizer connects to the Regex Engine tutorial (different parsing approach).
3. Prerequisites
- Complete the Neural Network tutorial first. This tutorial assumes you have a working autograd engine and can train a small MLP.
- Linear algebra: matrix multiplication, transpose, softmax. (Sem 1 Module 4.)
- Probability: cross-entropy, sampling. (Sem 1 Module 5.)
- Python: comfortable with NumPy or PyTorch tensors.
You do not need any prior NLP background. Karpathy and Raschka both build everything from scratch.
4. Theory & research
Required reading
- Andrej Karpathy, "Let's build GPT: from scratch, in code, spelled out" (YouTube video + nanoGPT repo + build-nanogpt repo) -- the canonical tutorial. ~2 hours of video. Walks line-by-line through a working transformer. â start here.
- Sebastian Raschka, Build a Large Language Model (From Scratch) -- Manning book + free companion GitHub + free 48-part YouTube live-coding series. Seven chapters: tokenization, attention, transformer, pretraining, fine-tuning. â deepest single resource.
Strongly recommended
- Vaswani et al., "Attention Is All You Need" (2017) -- the original transformer paper. arxiv:1706.03762. Read once after Karpathy. Short.
- Jay Alammar, "The Illustrated Transformer" (jalammar.github.io/illustrated-transformer/) -- the canonical visual explanation. Read alongside Karpathy.
- Karpathy's full "Neural Networks: Zero to Hero" series -- karpathy.ai/zero-to-hero.html. The full progression: micrograd -> makemore (bigrams) -> MLP -> backprop -> batch norm -> WaveNet -> GPT -> tokenizer.
Bonus depth
- Andrej Karpathy, "Let's build the GPT Tokenizer" (YouTube) -- companion video on byte-pair encoding (BPE). The tokenizer is the unsung hero of modern LLMs.
- Sebastian Raschka, "Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch" (sebastianraschka.com/blog/2025/bpe-from-scratch.html) -- focused on the tokenizer.
- Phil Wang's "Annotated GPT-2" -- a clean small implementation to read alongside.
Theory (for deeper understanding)
- Goodfellow, Bengio, Courville, Deep Learning, Chapter 10 (sequence modeling). Free online.
- Stanford CS224N -- Natural Language Processing with Deep Learning -- free course recordings on YouTube.
5. Curated tutorial list (from BYO-X)
The BYO-X "AI Model" category lists:
- Python: A Large Language Model (LLM) -- primary entry; see Karpathy and Raschka resources above
- Python: Diffusion Models for Image Generation -- see related Hugging Face Diffusion Course
- Python: RAG for Document Search -- see resources below for extensions
Additional canonical references
- karpathy/nanoGPT (github.com/karpathy/nanoGPT) -- the production-ready version of what Karpathy builds in the video. ~600 lines.
- karpathy/build-nanogpt (github.com/karpathy/build-nanogpt) -- step-by-step git tags matching the lecture video.
- rasbt/LLMs-from-scratch (github.com/rasbt/LLMs-from-scratch) -- Raschka's complete code from his book.
- Hugging Face NLP Course (huggingface.co/learn/nlp-course) -- free, comprehensive. Goes deeper than this project but includes a transformer-from-scratch chapter.
RAG-specific (for the extension milestone)
- Hugging Face, "Code a simple RAG from scratch"
- learnbybuilding.ai, "A beginner's guide to building a Retrieval Augmented Generation (RAG) application from scratch"
6. Recommended primary path
Andrej Karpathy, "Let's build GPT: from scratch, in code, spelled out".
Two hours of video. Karpathy starts from a tiny Shakespeare dataset and a bigram model, then layers in:
- The bigram baseline.
- A simple averaging "context-window" model.
- Self-attention.
- Multi-head attention.
- Feed-forward layers.
- Layer normalization.
- Scaling up: dropout, residual connections, larger context.
By the end you have a working ~10M-parameter Shakespeare model. Roughly 300 lines of Python + PyTorch.
For a more guided book-format experience: Sebastian Raschka's Build a Large Language Model (From Scratch). Same destination, more explanation, comes with 48 free YouTube videos.
For this degree: Karpathy first (2 days), Raschka if you want depth (2-4 weeks).
If you've never trained any neural network: do the Neural Network tutorial first. This tutorial assumes that foundation.
7. Implementation milestones
Milestone 1: Character-level tokenizer + bigram model
Read a corpus (Karpathy uses tinyshakespeare.txt). Build a character-level vocabulary.
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]
Build a bigram model: an vocab_size x vocab_size embedding table where row i predicts the distribution of next characters given current character i.
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
logits = self.token_embedding_table(idx)
if targets is None: return logits, None
B, T, C = logits.shape
loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
return logits, loss
Evidence: Sample from the trained bigram. Output should be character-level "noise that looks vaguely like English."
Milestone 2: Self-attention
The mathematical heart of the transformer.
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
def forward(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * C**-0.5
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # causal mask
wei = F.softmax(wei, dim=-1)
v = self.value(x)
out = wei @ v
return out
The intuition: for each position, compute a learned query vector. Compute key vectors for all earlier positions. Take their dot products. Softmax. Use the result to weight a sum of value vectors.
Evidence: Re-train with attention. Validation loss drops. Sampled text starts looking more coherent.
Milestone 3: Multi-head attention + feed-forward + position embeddings
class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.proj(out)
return out
class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
)
def forward(self, x):
return self.net(x)
Position embeddings: add a learned vector per position, so attention sees order.
Evidence: Validation loss drops further. Samples look like bad Shakespeare instead of random characters.
Milestone 4: Transformer block (attention + MLP + residuals + LayerNorm)
class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x)) # residual
x = x + self.ffwd(self.ln2(x)) # residual
return x
The residual connections (x + ...) are the single most important architectural feature for training deep networks.
Evidence: Stack 6 blocks. Validation loss drops to ~1.5. Samples look like attempted Shakespeare.
Milestone 5: Train at scale (10M parameters)
Increase: block_size=256, n_embd=384, n_head=6, n_layer=6. Add dropout (p=0.2).
Train for 5,000 steps on a GPU (10-20 minutes on a free Google Colab GPU).
Evidence: Generated text:
DUKE VINCENTIO:
Why, sir, by some good prince in this seal'd
Wrong'd shoulder gave him not, his other...
Recognizably Shakespeare-flavored. Not coherent, but the style is there.
Milestone 6 (Karpathy lecture 7): BPE tokenizer
Replace the character-level tokenizer with byte-pair encoding (BPE). This is what GPT-2/3/4 use.
Karpathy has a dedicated lecture: "Let's build the GPT Tokenizer". Sebastian Raschka has a companion blog post.
The basic algorithm:
def get_stats(tokens):
counts = {}
for pair in zip(tokens, tokens[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge(tokens, pair, new_id):
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
new_tokens.append(new_id)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
return new_tokens
Evidence: Compress 1MB of text by ~3x while keeping decodability.
Milestone 7 (optional, ambitious): Pretrain a real model
With nanoGPT and a few GPU hours, you can train a model on openwebtext (~8GB) at GPT-2 scale (~124M parameters). Result: a model that writes coherent English (not just Shakespeare style).
This requires a real GPU. Lambda Labs / RunPod / Vast.ai rent A100s for ~$1/hour. A full GPT-2-small training takes ~4 days on one A100.
Milestone 8 (optional, extension): RAG
Add a retrieval step: before generating, look up relevant context from a document store (embeddings + cosine similarity). Concatenate retrieved chunks into the prompt.
This connects to the Search Engine tutorial -- RAG is keyword search + LLM.
Evidence: A small Q&A demo: load your own PDF or notes; ask questions; model answers with grounded references.
8. Tests & evidence
| Test | How |
|---|---|
| Tokenizer round-trip | decode(encode(text)) == text for every test string |
| Bigram baseline | Validation loss should be ~2.5 (English entropy) |
| Attention output shape | (B, T, head_size) -- assert in tests |
| Causal mask | Position t cannot attend to positions > t. Test by zeroing future positions and confirming output unchanged |
| Validation loss curves | Loss drops at each milestone; plot is monotonic-ish |
| Sample quality | Manual grading. After milestone 5, samples should be recognizably Shakespeare-flavored |
| Compare against nanoGPT | At equal hyperparameters, your model and nanoGPT should match within tolerance |
The strongest evidence: a sample paragraph of generated text alongside a baseline (bigram). The improvement should be obvious.
9. Common pitfalls
- Forgetting the scaling factor in attention (
/ √d). Without it, softmax saturates and gradients vanish. - Wrong dimension order in
q @ k.transpose(-2, -1). Easy to swap. Result: nonsensical attention patterns. - Forgetting the causal mask. Without it, you're training a "see the future" model -- looks great on train, fails on inference.
- Position embeddings off-by-one. A token at position 0 should see position 0; an off-by-one breaks everything.
- Mixing batch and time dimensions in cross-entropy. Reshape carefully:
(B*T, C)and(B*T,). - LayerNorm before or after attention? Pre-norm (
x + sa(ln(x))) is the modern default. Post-norm (ln(x + sa(x))) was original; harder to train. - Dropout on inference. Don't. Call
model.eval(). - Forgetting
.eval()+torch.no_grad()during sampling. Will use 2x the memory and produce stochastic outputs you don't want. - Tokenization mismatch between training and inference. Use the same encoder both times.
- Training "loss looks great" but samples look bad. Likely cause: data leakage from val into train, or wrong masking.
10. Extensions
- BPE tokenizer. Milestone 6. Mentioned above.
- Larger context window. Karpathy uses 256. GPT-3 uses 2048. Modern models 128k+.
- Flash attention. Memory-efficient attention. Tri Dao's algorithm.
- Rotary positional embeddings (RoPE). Modern replacement for learned position embeddings.
- Mixture of Experts (MoE). What makes GPT-4 efficient.
- Fine-tuning. Take a pretrained model and fine-tune on a specific task (instruction-following, code, etc.).
- RLHF / DPO. Reinforcement learning from human feedback. The technique that turned GPT-3 into ChatGPT.
- Quantization -- 8-bit or 4-bit. Halves or quarters memory at minor quality loss.
- Distillation -- train a small model to imitate a large one.
- Vision transformer. Same architecture, replace tokens with image patches.
- RAG. Milestone 8. Connects to search.
11. Module integration
| Module | What the LLM deepens |
|---|---|
| Sem 1 Module 4 -- Linear algebra | Attention is softmax(Q K^T / √d) V. Every line is matmul. Internalizes shape arithmetic. |
| Sem 1 Module 5 -- Statistics / probability | Cross-entropy, sampling, temperature, top-k, nucleus sampling. |
| Sem 2 Module 4 -- Dynamic programming | Backprop through the entire transformer is one DAG. Same machinery as Neural Network tutorial. |
| Neural Network tutorial | Direct prerequisite -- the autograd engine you built is what powers this. |
| 3D Renderer tutorial | The two big Sem 1 math-heavy projects. Linear algebra + probability in different domains. |
| Search Engine tutorial | RAG is search + LLM. The combination is the dominant pattern in modern AI applications. |
| Regex Engine tutorial | Different parsing approaches -- finite automata vs subword tokenization. Both turn text into something processable. |
12. Portfolio framing
What to publish:
- Source organized as
tokenizer/,model/,train/,sample/. - A training curves plot: loss over training steps, baseline vs final.
- A sample paragraph of generated text. Pick the most coherent one from 5-10 samples.
- A README with:
- Model size (parameters).
- Training data (Shakespeare, Hacker News, your blog, etc.).
- Training hardware and wall time.
- Sample outputs.
- Honest assessment of capabilities and limitations.
What to keep private:
- Training data with private content (your own writing, anything copyrighted).
- API keys for any inference services you used for comparison.
Reviewer entry points:
model/transformer.py-- the architecture.model/attention.py-- the attention mechanism.train/loop.py-- the training loop.- README must include: training curves plot, sample paragraph, acknowledgement of Karpathy/Raschka as primary references.
A working GPT-from-scratch is a flagship portfolio piece. "I trained a 10M-parameter transformer that writes Shakespeare-flavored text from a corpus I prepared myself" is concrete, verifiable, and demonstrates depth beyond using a pre-trained API.
Honesty disclaimer
A 10M-parameter model from this tutorial is not ChatGPT. It is the architecture of ChatGPT, trained at 0.01% the scale. The right framing in your portfolio:
"I implemented a complete GPT-style transformer from scratch, trained on a small corpus. The architecture mirrors GPT-2; the scale is much smaller. The point is depth of understanding, not competitive model quality."
This honesty strengthens the portfolio because it shows technical maturity. Overclaiming weakens it.
13. Local source backbone
Use the local LLM chunks as a chapter map for a fuller semester pass:
- Building LLMs From Scratch (
build-your-own/building-llms-from-scratch) - 2024 Build LLMs (
build-your-own/llms-2024) - Build a Large Language Model From Scratch (
build-your-own/large-language-model-raschka)
These sources should expand the project into a reproducible lab notebook, not replace Karpathy's minimal build.
| Local chunks | Use them for | Add to this project |
|---|---|---|
building-llms-from-scratch-contents/002-008 | Big-picture LLM workflow, data setup, and tokenization foundations | Add a tokenizer design note comparing character, word, BPE, and GPT-style tokenization. |
009-018 | Embeddings, attention, causal masks, and transformer blocks | Add shape tables for every tensor in the forward pass. |
019-023 | Training loop, optimizer, evaluation, and sample generation | Add a reproducibility packet: seed, device, batch size, context length, train/val loss. |
024-032 | Fine-tuning, instruction tuning, and practical next steps | Add an extension path: domain continuation pretraining, then instruction tuning on a tiny curated set. |
2024-build-llms-contents/001-003 | Architecture summaries and technical slide responses | Use as review prompts after the transformer is working. |
| Raschka chunks | Longer-form implementation detail across tokenizer, GPT model, pretraining, and fine-tuning | Use as the deep reading path for learners who want book-length scaffolding. |
Extra checkpoints from the book chunks
- Tokenizer audit: train or implement a small tokenizer and show how it segments code, prose, numbers, and rare words.
- Attention audit: print the causal mask and one attention matrix for a tiny batch; explain which tokens can attend to which prior tokens.
- Scaling audit: run the same code at three model sizes and report loss, tokens/sec, memory use, and sample quality.
- Fine-tuning audit: compare base-model samples and fine-tuned samples on the same prompts, then document failure modes.
14. Deep project spec
Project contract
Build a small GPT-style language model with a reproducible training packet. The minimum contract is tokenizer, dataset split, embeddings, causal self-attention, multi-head attention, feed-forward block, residual/LayerNorm stack, training loop, sampling, evaluation loss, and an honesty note about scale. RAG and fine-tuning are extensions.
Source-backed reading map
| Source ID | Use for | Required output |
|---|---|---|
build-your-own/building-llms-from-scratch | tokenizer, embeddings, attention, transformer block, training, fine-tuning | tokenizer audit, tensor-shape tables, training packet |
build-your-own/llms-2024 | architecture review and implementation prompts | review questions and design recap |
build-your-own/large-language-model-raschka | book-length GPT implementation detail | deep reading path and optional fine-tuning checkpoints |
Milestone map
| Milestone | Deliverable | Tests | Failure case |
|---|---|---|---|
| Dataset/tokenizer | train/val split and tokenizer | encode/decode round trip | unknown/rare token behavior |
| Bigram baseline | simplest model and loss | baseline loss fixture | leakage between train/val |
| Attention | causal mask and attention weights | shape/mask tests | token attends to future |
| Transformer block | MHA, MLP, residual, LayerNorm | tensor-shape snapshots | unstable loss or NaNs |
| Training loop | optimizer, checkpoints, metrics | fixed-seed run | non-reproducible run |
| Sampling | temperature/top-k if included | prompt-output transcript | degenerate repetition |
| Scaling/fine-tuning extension | three model sizes or tiny instruction set | comparison report | overclaiming model ability |
Test matrix
| Test type | Required examples |
|---|---|
| Unit | tokenizer round trip, mask shape, logits shape |
| Numerical | attention probabilities sum correctly; no future-token leakage |
| Golden | tiny-batch forward pass shape table |
| Experiment | fixed config, seed, loss curve, samples at checkpoints |
| Benchmark | tokens/sec and memory use for at least two model sizes |
| Evaluation | base vs fine-tuned or baseline vs transformer comparison |
Design notes required
tokenizer.md: tokenizer choice, examples, compression/coverage tradeoffs.architecture.md: tensor shapes for every major operation.training.md: corpus, split, seed, batch size, context length, optimizer, hardware.limitations.md: scale, hallucination, data quality, and why this is not a production assistant.
Portfolio evidence
Publish the training config, loss curve, sample generations at multiple checkpoints, attention/mask visualization for a tiny batch, tokenizer audit, and explicit scale/limitation disclaimer.
Source
This tutorial draws from the BYO-X catalog "AI Model" section ("A Large Language Model"). Andrej Karpathy's "Let's build GPT" lecture, "Neural Networks: Zero to Hero" series, and Sebastian Raschka's Build a Large Language Model (From Scratch) are the canonical primary references.