Skip to main content

Build Your Own LLM (GPT from Scratch)

"By the end of this lecture, you will have a 10-million-parameter language model that generates Shakespeare-like text. It will be small but real." -- Andrej Karpathy, "Let's build GPT"

Building a small GPT-style language model from scratch is the cleanest possible path from "I understand backprop" to "I understand modern AI." You build the tokenizer, the attention mechanism, the transformer block, the training loop. The result trains on a single GPU (or even a fast CPU) and writes recognizable English. 300 lines of Python.

This is the natural continuation of the Neural Network tutorial -- same audience, same tools, same author for the canonical primary path (Karpathy's "Zero to Hero" series).


1. Overview & motivation

A transformer-based language model has these components:

text  -> [tokenizer]   -> token ids
ids -> [embed] -> vectors per token
vec -> [transformer blocks x N]
-> ([self-attention] -> [feed-forward]) x N
out -> [unembedding] -> logits over vocabulary
logits -> [softmax] -> probability distribution
-> [sample] -> next token

You build every piece. By the end:

  • You can train a 10M-100M parameter model that writes coherent text in your domain (Shakespeare, code, recipes, anything).
  • You understand what every line of a real transformer does.
  • You can read modern ML papers and recognize the constructions in code.

What you can only learn by building one:

  • Why attention is a soft weighted lookup -- and why that's such a powerful primitive.
  • Why causal masking is what makes a language model "language" (predict next, not all).
  • Why positional embeddings exist (attention is permutation-invariant; we need order).
  • Why layer normalization stabilizes deep transformers (and why everyone moved from LayerNorm to RMSNorm).
  • Why training a 100M-parameter model is not hard -- but training a 100B-parameter model is.

2. Where this fits in the degree

  • Phase: Foundations
  • Semester: 1 (Math Foundations) + Sem 2 (Algorithms)
  • Modules deepened:
    • Sem 1 Module 4 (linear algebra) -- attention is softmax(Q K^T / √d) V. Every line is matrix multiplication.
    • Sem 1 Module 5 (probability / statistics) -- cross-entropy loss, sampling from a categorical distribution, temperature.
    • Sem 2 Module 4 (DP) -- backprop through the entire transformer is one big DAG. Same backward-pass machinery as in the Neural Network tutorial.

Cross-phase relevance:

  • Direct extension of the Neural Network tutorial. Use the autograd engine you built there.
  • Connects to modern AI engineering, search relevance, code generation.
  • The tokenizer connects to the Regex Engine tutorial (different parsing approach).

3. Prerequisites

  • Complete the Neural Network tutorial first. This tutorial assumes you have a working autograd engine and can train a small MLP.
  • Linear algebra: matrix multiplication, transpose, softmax. (Sem 1 Module 4.)
  • Probability: cross-entropy, sampling. (Sem 1 Module 5.)
  • Python: comfortable with NumPy or PyTorch tensors.

You do not need any prior NLP background. Karpathy and Raschka both build everything from scratch.


4. Theory & research

Required reading

  • Vaswani et al., "Attention Is All You Need" (2017) -- the original transformer paper. arxiv:1706.03762. Read once after Karpathy. Short.
  • Jay Alammar, "The Illustrated Transformer" (jalammar.github.io/illustrated-transformer/) -- the canonical visual explanation. Read alongside Karpathy.
  • Karpathy's full "Neural Networks: Zero to Hero" series -- karpathy.ai/zero-to-hero.html. The full progression: micrograd -> makemore (bigrams) -> MLP -> backprop -> batch norm -> WaveNet -> GPT -> tokenizer.

Bonus depth

  • Andrej Karpathy, "Let's build the GPT Tokenizer" (YouTube) -- companion video on byte-pair encoding (BPE). The tokenizer is the unsung hero of modern LLMs.
  • Sebastian Raschka, "Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch" (sebastianraschka.com/blog/2025/bpe-from-scratch.html) -- focused on the tokenizer.
  • Phil Wang's "Annotated GPT-2" -- a clean small implementation to read alongside.

Theory (for deeper understanding)

  • Goodfellow, Bengio, Courville, Deep Learning, Chapter 10 (sequence modeling). Free online.
  • Stanford CS224N -- Natural Language Processing with Deep Learning -- free course recordings on YouTube.

5. Curated tutorial list (from BYO-X)

The BYO-X "AI Model" category lists:

  • Python: A Large Language Model (LLM) -- primary entry; see Karpathy and Raschka resources above
  • Python: Diffusion Models for Image Generation -- see related Hugging Face Diffusion Course
  • Python: RAG for Document Search -- see resources below for extensions

Additional canonical references

RAG-specific (for the extension milestone)


Andrej Karpathy, "Let's build GPT: from scratch, in code, spelled out".

Two hours of video. Karpathy starts from a tiny Shakespeare dataset and a bigram model, then layers in:

  1. The bigram baseline.
  2. A simple averaging "context-window" model.
  3. Self-attention.
  4. Multi-head attention.
  5. Feed-forward layers.
  6. Layer normalization.
  7. Scaling up: dropout, residual connections, larger context.

By the end you have a working ~10M-parameter Shakespeare model. Roughly 300 lines of Python + PyTorch.

For a more guided book-format experience: Sebastian Raschka's Build a Large Language Model (From Scratch). Same destination, more explanation, comes with 48 free YouTube videos.

For this degree: Karpathy first (2 days), Raschka if you want depth (2-4 weeks).

If you've never trained any neural network: do the Neural Network tutorial first. This tutorial assumes that foundation.


7. Implementation milestones

Milestone 1: Character-level tokenizer + bigram model

Read a corpus (Karpathy uses tinyshakespeare.txt). Build a character-level vocabulary.

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Build a bigram model: an vocab_size x vocab_size embedding table where row i predicts the distribution of next characters given current character i.

class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

def forward(self, idx, targets=None):
logits = self.token_embedding_table(idx)
if targets is None: return logits, None
B, T, C = logits.shape
loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
return logits, loss

Evidence: Sample from the trained bigram. Output should be character-level "noise that looks vaguely like English."

Milestone 2: Self-attention

The mathematical heart of the transformer.

class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

def forward(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * C**-0.5
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # causal mask
wei = F.softmax(wei, dim=-1)
v = self.value(x)
out = wei @ v
return out

The intuition: for each position, compute a learned query vector. Compute key vectors for all earlier positions. Take their dot products. Softmax. Use the result to weight a sum of value vectors.

Evidence: Re-train with attention. Validation loss drops. Sampled text starts looking more coherent.

Milestone 3: Multi-head attention + feed-forward + position embeddings

class MultiHeadAttention(nn.Module):
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
out = self.proj(out)
return out

class FeedForward(nn.Module):
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
nn.Linear(n_embd, 4 * n_embd),
nn.ReLU(),
nn.Linear(4 * n_embd, n_embd),
)
def forward(self, x):
return self.net(x)

Position embeddings: add a learned vector per position, so attention sees order.

Evidence: Validation loss drops further. Samples look like bad Shakespeare instead of random characters.

Milestone 4: Transformer block (attention + MLP + residuals + LayerNorm)

class Block(nn.Module):
def __init__(self, n_embd, n_head):
super().__init__()
head_size = n_embd // n_head
self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedForward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
x = x + self.sa(self.ln1(x)) # residual
x = x + self.ffwd(self.ln2(x)) # residual
return x

The residual connections (x + ...) are the single most important architectural feature for training deep networks.

Evidence: Stack 6 blocks. Validation loss drops to ~1.5. Samples look like attempted Shakespeare.

Milestone 5: Train at scale (10M parameters)

Increase: block_size=256, n_embd=384, n_head=6, n_layer=6. Add dropout (p=0.2).

Train for 5,000 steps on a GPU (10-20 minutes on a free Google Colab GPU).

Evidence: Generated text:

DUKE VINCENTIO:
Why, sir, by some good prince in this seal'd
Wrong'd shoulder gave him not, his other...

Recognizably Shakespeare-flavored. Not coherent, but the style is there.

Milestone 6 (Karpathy lecture 7): BPE tokenizer

Replace the character-level tokenizer with byte-pair encoding (BPE). This is what GPT-2/3/4 use.

Karpathy has a dedicated lecture: "Let's build the GPT Tokenizer". Sebastian Raschka has a companion blog post.

The basic algorithm:

def get_stats(tokens):
counts = {}
for pair in zip(tokens, tokens[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts

def merge(tokens, pair, new_id):
new_tokens = []
i = 0
while i < len(tokens):
if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
new_tokens.append(new_id)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
return new_tokens

Evidence: Compress 1MB of text by ~3x while keeping decodability.

Milestone 7 (optional, ambitious): Pretrain a real model

With nanoGPT and a few GPU hours, you can train a model on openwebtext (~8GB) at GPT-2 scale (~124M parameters). Result: a model that writes coherent English (not just Shakespeare style).

This requires a real GPU. Lambda Labs / RunPod / Vast.ai rent A100s for ~$1/hour. A full GPT-2-small training takes ~4 days on one A100.

Milestone 8 (optional, extension): RAG

Add a retrieval step: before generating, look up relevant context from a document store (embeddings + cosine similarity). Concatenate retrieved chunks into the prompt.

This connects to the Search Engine tutorial -- RAG is keyword search + LLM.

Evidence: A small Q&A demo: load your own PDF or notes; ask questions; model answers with grounded references.


8. Tests & evidence

TestHow
Tokenizer round-tripdecode(encode(text)) == text for every test string
Bigram baselineValidation loss should be ~2.5 (English entropy)
Attention output shape(B, T, head_size) -- assert in tests
Causal maskPosition t cannot attend to positions > t. Test by zeroing future positions and confirming output unchanged
Validation loss curvesLoss drops at each milestone; plot is monotonic-ish
Sample qualityManual grading. After milestone 5, samples should be recognizably Shakespeare-flavored
Compare against nanoGPTAt equal hyperparameters, your model and nanoGPT should match within tolerance

The strongest evidence: a sample paragraph of generated text alongside a baseline (bigram). The improvement should be obvious.


9. Common pitfalls

  • Forgetting the scaling factor in attention (/ √d). Without it, softmax saturates and gradients vanish.
  • Wrong dimension order in q @ k.transpose(-2, -1). Easy to swap. Result: nonsensical attention patterns.
  • Forgetting the causal mask. Without it, you're training a "see the future" model -- looks great on train, fails on inference.
  • Position embeddings off-by-one. A token at position 0 should see position 0; an off-by-one breaks everything.
  • Mixing batch and time dimensions in cross-entropy. Reshape carefully: (B*T, C) and (B*T,).
  • LayerNorm before or after attention? Pre-norm (x + sa(ln(x))) is the modern default. Post-norm (ln(x + sa(x))) was original; harder to train.
  • Dropout on inference. Don't. Call model.eval().
  • Forgetting .eval() + torch.no_grad() during sampling. Will use 2x the memory and produce stochastic outputs you don't want.
  • Tokenization mismatch between training and inference. Use the same encoder both times.
  • Training "loss looks great" but samples look bad. Likely cause: data leakage from val into train, or wrong masking.

10. Extensions

  • BPE tokenizer. Milestone 6. Mentioned above.
  • Larger context window. Karpathy uses 256. GPT-3 uses 2048. Modern models 128k+.
  • Flash attention. Memory-efficient attention. Tri Dao's algorithm.
  • Rotary positional embeddings (RoPE). Modern replacement for learned position embeddings.
  • Mixture of Experts (MoE). What makes GPT-4 efficient.
  • Fine-tuning. Take a pretrained model and fine-tune on a specific task (instruction-following, code, etc.).
  • RLHF / DPO. Reinforcement learning from human feedback. The technique that turned GPT-3 into ChatGPT.
  • Quantization -- 8-bit or 4-bit. Halves or quarters memory at minor quality loss.
  • Distillation -- train a small model to imitate a large one.
  • Vision transformer. Same architecture, replace tokens with image patches.
  • RAG. Milestone 8. Connects to search.

11. Module integration

ModuleWhat the LLM deepens
Sem 1 Module 4 -- Linear algebraAttention is softmax(Q K^T / √d) V. Every line is matmul. Internalizes shape arithmetic.
Sem 1 Module 5 -- Statistics / probabilityCross-entropy, sampling, temperature, top-k, nucleus sampling.
Sem 2 Module 4 -- Dynamic programmingBackprop through the entire transformer is one DAG. Same machinery as Neural Network tutorial.
Neural Network tutorialDirect prerequisite -- the autograd engine you built is what powers this.
3D Renderer tutorialThe two big Sem 1 math-heavy projects. Linear algebra + probability in different domains.
Search Engine tutorialRAG is search + LLM. The combination is the dominant pattern in modern AI applications.
Regex Engine tutorialDifferent parsing approaches -- finite automata vs subword tokenization. Both turn text into something processable.

12. Portfolio framing

What to publish:

  • Source organized as tokenizer/, model/, train/, sample/.
  • A training curves plot: loss over training steps, baseline vs final.
  • A sample paragraph of generated text. Pick the most coherent one from 5-10 samples.
  • A README with:
    • Model size (parameters).
    • Training data (Shakespeare, Hacker News, your blog, etc.).
    • Training hardware and wall time.
    • Sample outputs.
    • Honest assessment of capabilities and limitations.

What to keep private:

  • Training data with private content (your own writing, anything copyrighted).
  • API keys for any inference services you used for comparison.

Reviewer entry points:

  • model/transformer.py -- the architecture.
  • model/attention.py -- the attention mechanism.
  • train/loop.py -- the training loop.
  • README must include: training curves plot, sample paragraph, acknowledgement of Karpathy/Raschka as primary references.

A working GPT-from-scratch is a flagship portfolio piece. "I trained a 10M-parameter transformer that writes Shakespeare-flavored text from a corpus I prepared myself" is concrete, verifiable, and demonstrates depth beyond using a pre-trained API.

Honesty disclaimer

A 10M-parameter model from this tutorial is not ChatGPT. It is the architecture of ChatGPT, trained at 0.01% the scale. The right framing in your portfolio:

"I implemented a complete GPT-style transformer from scratch, trained on a small corpus. The architecture mirrors GPT-2; the scale is much smaller. The point is depth of understanding, not competitive model quality."

This honesty strengthens the portfolio because it shows technical maturity. Overclaiming weakens it.


13. Local source backbone

Use the local LLM chunks as a chapter map for a fuller semester pass:

  • Building LLMs From Scratch (build-your-own/building-llms-from-scratch)
  • 2024 Build LLMs (build-your-own/llms-2024)
  • Build a Large Language Model From Scratch (build-your-own/large-language-model-raschka)

These sources should expand the project into a reproducible lab notebook, not replace Karpathy's minimal build.

Local chunksUse them forAdd to this project
building-llms-from-scratch-contents/002-008Big-picture LLM workflow, data setup, and tokenization foundationsAdd a tokenizer design note comparing character, word, BPE, and GPT-style tokenization.
009-018Embeddings, attention, causal masks, and transformer blocksAdd shape tables for every tensor in the forward pass.
019-023Training loop, optimizer, evaluation, and sample generationAdd a reproducibility packet: seed, device, batch size, context length, train/val loss.
024-032Fine-tuning, instruction tuning, and practical next stepsAdd an extension path: domain continuation pretraining, then instruction tuning on a tiny curated set.
2024-build-llms-contents/001-003Architecture summaries and technical slide responsesUse as review prompts after the transformer is working.
Raschka chunksLonger-form implementation detail across tokenizer, GPT model, pretraining, and fine-tuningUse as the deep reading path for learners who want book-length scaffolding.

Extra checkpoints from the book chunks

  1. Tokenizer audit: train or implement a small tokenizer and show how it segments code, prose, numbers, and rare words.
  2. Attention audit: print the causal mask and one attention matrix for a tiny batch; explain which tokens can attend to which prior tokens.
  3. Scaling audit: run the same code at three model sizes and report loss, tokens/sec, memory use, and sample quality.
  4. Fine-tuning audit: compare base-model samples and fine-tuned samples on the same prompts, then document failure modes.

14. Deep project spec

Project contract

Build a small GPT-style language model with a reproducible training packet. The minimum contract is tokenizer, dataset split, embeddings, causal self-attention, multi-head attention, feed-forward block, residual/LayerNorm stack, training loop, sampling, evaluation loss, and an honesty note about scale. RAG and fine-tuning are extensions.

Source-backed reading map

Source IDUse forRequired output
build-your-own/building-llms-from-scratchtokenizer, embeddings, attention, transformer block, training, fine-tuningtokenizer audit, tensor-shape tables, training packet
build-your-own/llms-2024architecture review and implementation promptsreview questions and design recap
build-your-own/large-language-model-raschkabook-length GPT implementation detaildeep reading path and optional fine-tuning checkpoints

Milestone map

MilestoneDeliverableTestsFailure case
Dataset/tokenizertrain/val split and tokenizerencode/decode round tripunknown/rare token behavior
Bigram baselinesimplest model and lossbaseline loss fixtureleakage between train/val
Attentioncausal mask and attention weightsshape/mask teststoken attends to future
Transformer blockMHA, MLP, residual, LayerNormtensor-shape snapshotsunstable loss or NaNs
Training loopoptimizer, checkpoints, metricsfixed-seed runnon-reproducible run
Samplingtemperature/top-k if includedprompt-output transcriptdegenerate repetition
Scaling/fine-tuning extensionthree model sizes or tiny instruction setcomparison reportoverclaiming model ability

Test matrix

Test typeRequired examples
Unittokenizer round trip, mask shape, logits shape
Numericalattention probabilities sum correctly; no future-token leakage
Goldentiny-batch forward pass shape table
Experimentfixed config, seed, loss curve, samples at checkpoints
Benchmarktokens/sec and memory use for at least two model sizes
Evaluationbase vs fine-tuned or baseline vs transformer comparison

Design notes required

  • tokenizer.md: tokenizer choice, examples, compression/coverage tradeoffs.
  • architecture.md: tensor shapes for every major operation.
  • training.md: corpus, split, seed, batch size, context length, optimizer, hardware.
  • limitations.md: scale, hallucination, data quality, and why this is not a production assistant.

Portfolio evidence

Publish the training config, loss curve, sample generations at multiple checkpoints, attention/mask visualization for a tiny batch, tokenizer audit, and explicit scale/limitation disclaimer.


Source

This tutorial draws from the BYO-X catalog "AI Model" section ("A Large Language Model"). Andrej Karpathy's "Let's build GPT" lecture, "Neural Networks: Zero to Hero" series, and Sebastian Raschka's Build a Large Language Model (From Scratch) are the canonical primary references.