Build Your Own LLM (GPT from Scratch)

"By the end of this lecture, you will have a 10-million-parameter language model that generates Shakespeare-like text. It will be small but real." -- Andrej Karpathy, "Let's build GPT"

Building a small GPT-style language model from scratch is the cleanest possible path from "I understand backprop" to "I understand modern AI." You build the tokenizer, the attention mechanism, the transformer block, the training loop. The result trains on a single GPU (or even a fast CPU) and writes recognizable English. 300 lines of Python.

This is the natural continuation of the Neural Network tutorial -- same audience, same tools, same author for the canonical primary path (Karpathy's "Zero to Hero" series).

1. Overview & motivation

A transformer-based language model has these components:

text  -> [tokenizer]   -> token ids
ids   -> [embed]       -> vectors per token
vec   -> [transformer blocks x N]
                       -> ([self-attention] -> [feed-forward]) x N
out   -> [unembedding] -> logits over vocabulary
logits -> [softmax]    -> probability distribution
        -> [sample]    -> next token

You build every piece. By the end:

You can train a 10M-100M parameter model that writes coherent text in your domain (Shakespeare, code, recipes, anything).
You understand what every line of a real transformer does.
You can read modern ML papers and recognize the constructions in code.

What you can only learn by building one:

Why attention is a soft weighted lookup -- and why that's such a powerful primitive.
Why causal masking is what makes a language model "language" (predict next, not all).
Why positional embeddings exist (attention is permutation-invariant; we need order).
Why layer normalization stabilizes deep transformers (and why everyone moved from LayerNorm to RMSNorm).
Why training a 100M-parameter model is not hard -- but training a 100B-parameter model is.

2. Where this fits in the degree

Phase: Foundations
Semester: 1 (Math Foundations) + Sem 2 (Algorithms)
Modules deepened:
- Sem 1 Module 4 (linear algebra) -- attention is softmax(Q K^T / âˆšd) V. Every line is matrix multiplication.
- Sem 1 Module 5 (probability / statistics) -- cross-entropy loss, sampling from a categorical distribution, temperature.
- Sem 2 Module 4 (DP) -- backprop through the entire transformer is one big DAG. Same backward-pass machinery as in the Neural Network tutorial.

Cross-phase relevance:

Direct extension of the Neural Network tutorial. Use the autograd engine you built there.
Connects to modern AI engineering, search relevance, code generation.
The tokenizer connects to the Regex Engine tutorial (different parsing approach).

3. Prerequisites

Complete the Neural Network tutorial first. This tutorial assumes you have a working autograd engine and can train a small MLP.
Linear algebra: matrix multiplication, transpose, softmax. (Sem 1 Module 4.)
Probability: cross-entropy, sampling. (Sem 1 Module 5.)
Python: comfortable with NumPy or PyTorch tensors.

You do not need any prior NLP background. Karpathy and Raschka both build everything from scratch.

4. Theory & research

Required reading

Andrej Karpathy, "Let's build GPT: from scratch, in code, spelled out" (YouTube video + nanoGPT repo + build-nanogpt repo) -- the canonical tutorial. ~2 hours of video. Walks line-by-line through a working transformer. â start here.
Sebastian Raschka, Build a Large Language Model (From Scratch) -- Manning book + free companion GitHub + free 48-part YouTube live-coding series. Seven chapters: tokenization, attention, transformer, pretraining, fine-tuning. â deepest single resource.

Strongly recommended

Vaswani et al., "Attention Is All You Need" (2017) -- the original transformer paper. arxiv:1706.03762. Read once after Karpathy. Short.
Jay Alammar, "The Illustrated Transformer" (jalammar.github.io/illustrated-transformer/) -- the canonical visual explanation. Read alongside Karpathy.
Karpathy's full "Neural Networks: Zero to Hero" series -- karpathy.ai/zero-to-hero.html. The full progression: micrograd -> makemore (bigrams) -> MLP -> backprop -> batch norm -> WaveNet -> GPT -> tokenizer.

Bonus depth

Andrej Karpathy, "Let's build the GPT Tokenizer" (YouTube) -- companion video on byte-pair encoding (BPE). The tokenizer is the unsung hero of modern LLMs.
Sebastian Raschka, "Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch" (sebastianraschka.com/blog/2025/bpe-from-scratch.html) -- focused on the tokenizer.
Phil Wang's "Annotated GPT-2" -- a clean small implementation to read alongside.

Theory (for deeper understanding)

Goodfellow, Bengio, Courville, Deep Learning, Chapter 10 (sequence modeling). Free online.
Stanford CS224N -- Natural Language Processing with Deep Learning -- free course recordings on YouTube.

5. Curated tutorial list (from BYO-X)

The BYO-X "AI Model" category lists:

Python: A Large Language Model (LLM) -- primary entry; see Karpathy and Raschka resources above
Python: Diffusion Models for Image Generation -- see related Hugging Face Diffusion Course
Python: RAG for Document Search -- see resources below for extensions

Additional canonical references

karpathy/nanoGPT (github.com/karpathy/nanoGPT) -- the production-ready version of what Karpathy builds in the video. ~600 lines.
karpathy/build-nanogpt (github.com/karpathy/build-nanogpt) -- step-by-step git tags matching the lecture video.
rasbt/LLMs-from-scratch (github.com/rasbt/LLMs-from-scratch) -- Raschka's complete code from his book.
Hugging Face NLP Course (huggingface.co/learn/nlp-course) -- free, comprehensive. Goes deeper than this project but includes a transformer-from-scratch chapter.

RAG-specific (for the extension milestone)

6. Recommended primary path

Andrej Karpathy, "Let's build GPT: from scratch, in code, spelled out".

Two hours of video. Karpathy starts from a tiny Shakespeare dataset and a bigram model, then layers in:

The bigram baseline.
A simple averaging "context-window" model.
Self-attention.
Multi-head attention.
Feed-forward layers.
Layer normalization.
Scaling up: dropout, residual connections, larger context.

By the end you have a working ~10M-parameter Shakespeare model. Roughly 300 lines of Python + PyTorch.

For a more guided book-format experience: Sebastian Raschka's Build a Large Language Model (From Scratch). Same destination, more explanation, comes with 48 free YouTube videos.

For this degree: Karpathy first (2 days), Raschka if you want depth (2-4 weeks).

If you've never trained any neural network: do the Neural Network tutorial first. This tutorial assumes that foundation.

7. Implementation milestones

Milestone 1: Character-level tokenizer + bigram model

Read a corpus (Karpathy uses tinyshakespeare.txt). Build a character-level vocabulary.

chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Build a bigram model: an vocab_size x vocab_size embedding table where row i predicts the distribution of next characters given current character i.

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
    
    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)
        if targets is None: return logits, None
        B, T, C = logits.shape
        loss = F.cross_entropy(logits.view(B*T, C), targets.view(B*T))
        return logits, loss

Evidence: Sample from the trained bigram. Output should be character-level "noise that looks vaguely like English."

Milestone 2: Self-attention

The mathematical heart of the transformer.

class Head(nn.Module):
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * C**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # causal mask
        wei = F.softmax(wei, dim=-1)
        v = self.value(x)
        out = wei @ v
        return out

The intuition: for each position, compute a learned query vector. Compute key vectors for all earlier positions. Take their dot products. Softmax. Use the result to weight a sum of value vectors.

Evidence: Re-train with attention. Validation loss drops. Sampled text starts looking more coherent.

Milestone 3: Multi-head attention + feed-forward + position embeddings

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return out

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )
    def forward(self, x):
        return self.net(x)

Position embeddings: add a learned vector per position, so attention sees order.

Evidence: Validation loss drops further. Samples look like bad Shakespeare instead of random characters.

Milestone 4: Transformer block (attention + MLP + residuals + LayerNorm)

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
    def forward(self, x):
        x = x + self.sa(self.ln1(x))    # residual
        x = x + self.ffwd(self.ln2(x))  # residual
        return x

The residual connections (x + ...) are the single most important architectural feature for training deep networks.

Evidence: Stack 6 blocks. Validation loss drops to ~1.5. Samples look like attempted Shakespeare.

Milestone 5: Train at scale (10M parameters)

Increase: block_size=256, n_embd=384, n_head=6, n_layer=6. Add dropout (p=0.2).

Train for 5,000 steps on a GPU (10-20 minutes on a free Google Colab GPU).

Evidence: Generated text:

DUKE VINCENTIO:
Why, sir, by some good prince in this seal'd
Wrong'd shoulder gave him not, his other...

Recognizably Shakespeare-flavored. Not coherent, but the style is there.

Milestone 6 (Karpathy lecture 7): BPE tokenizer

Replace the character-level tokenizer with byte-pair encoding (BPE). This is what GPT-2/3/4 use.

Karpathy has a dedicated lecture: "Let's build the GPT Tokenizer". Sebastian Raschka has a companion blog post.

The basic algorithm:

def get_stats(tokens):
    counts = {}
    for pair in zip(tokens, tokens[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(tokens, pair, new_id):
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and tokens[i] == pair[0] and tokens[i+1] == pair[1]:
            new_tokens.append(new_id)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    return new_tokens

Evidence: Compress 1MB of text by ~3x while keeping decodability.

Milestone 7 (optional, ambitious): Pretrain a real model

With nanoGPT and a few GPU hours, you can train a model on openwebtext (~8GB) at GPT-2 scale (~124M parameters). Result: a model that writes coherent English (not just Shakespeare style).

This requires a real GPU. Lambda Labs / RunPod / Vast.ai rent A100s for ~$1/hour. A full GPT-2-small training takes ~4 days on one A100.

Milestone 8 (optional, extension): RAG

Add a retrieval step: before generating, look up relevant context from a document store (embeddings + cosine similarity). Concatenate retrieved chunks into the prompt.

This connects to the Search Engine tutorial -- RAG is keyword search + LLM.

Evidence: A small Q&A demo: load your own PDF or notes; ask questions; model answers with grounded references.

8. Tests & evidence

Test	How
Tokenizer round-trip	`decode(encode(text)) == text` for every test string
Bigram baseline	Validation loss should be `~2.5` (English entropy)
Attention output shape	`(B, T, head_size)` -- assert in tests
Causal mask	Position `t` cannot attend to positions `> t`. Test by zeroing future positions and confirming output unchanged
Validation loss curves	Loss drops at each milestone; plot is monotonic-ish
Sample quality	Manual grading. After milestone 5, samples should be recognizably Shakespeare-flavored
Compare against nanoGPT	At equal hyperparameters, your model and nanoGPT should match within tolerance

The strongest evidence: a sample paragraph of generated text alongside a baseline (bigram). The improvement should be obvious.

9. Common pitfalls

Forgetting the scaling factor in attention (/ âˆšd). Without it, softmax saturates and gradients vanish.
Wrong dimension order in q @ k.transpose(-2, -1). Easy to swap. Result: nonsensical attention patterns.
Forgetting the causal mask. Without it, you're training a "see the future" model -- looks great on train, fails on inference.
Position embeddings off-by-one. A token at position 0 should see position 0; an off-by-one breaks everything.
Mixing batch and time dimensions in cross-entropy. Reshape carefully: (B*T, C) and (B*T,).
LayerNorm before or after attention? Pre-norm (x + sa(ln(x))) is the modern default. Post-norm (ln(x + sa(x))) was original; harder to train.
Dropout on inference. Don't. Call model.eval().
Forgetting .eval() + torch.no_grad() during sampling. Will use 2x the memory and produce stochastic outputs you don't want.
Tokenization mismatch between training and inference. Use the same encoder both times.
Training "loss looks great" but samples look bad. Likely cause: data leakage from val into train, or wrong masking.

10. Extensions

BPE tokenizer. Milestone 6. Mentioned above.
Larger context window. Karpathy uses 256. GPT-3 uses 2048. Modern models 128k+.
Flash attention. Memory-efficient attention. Tri Dao's algorithm.
Rotary positional embeddings (RoPE). Modern replacement for learned position embeddings.
Mixture of Experts (MoE). What makes GPT-4 efficient.
Fine-tuning. Take a pretrained model and fine-tune on a specific task (instruction-following, code, etc.).
RLHF / DPO. Reinforcement learning from human feedback. The technique that turned GPT-3 into ChatGPT.
Quantization -- 8-bit or 4-bit. Halves or quarters memory at minor quality loss.
Distillation -- train a small model to imitate a large one.
Vision transformer. Same architecture, replace tokens with image patches.
RAG. Milestone 8. Connects to search.

11. Module integration

Module	What the LLM deepens
Sem 1 Module 4 -- Linear algebra	Attention is `softmax(Q K^T / âˆšd) V`. Every line is matmul. Internalizes shape arithmetic.
Sem 1 Module 5 -- Statistics / probability	Cross-entropy, sampling, temperature, top-k, nucleus sampling.
Sem 2 Module 4 -- Dynamic programming	Backprop through the entire transformer is one DAG. Same machinery as Neural Network tutorial.
Neural Network tutorial	Direct prerequisite -- the autograd engine you built is what powers this.
3D Renderer tutorial	The two big Sem 1 math-heavy projects. Linear algebra + probability in different domains.
Search Engine tutorial	RAG is search + LLM. The combination is the dominant pattern in modern AI applications.
Regex Engine tutorial	Different parsing approaches -- finite automata vs subword tokenization. Both turn text into something processable.

12. Portfolio framing

What to publish:

Source organized as tokenizer/, model/, train/, sample/.
A training curves plot: loss over training steps, baseline vs final.
A sample paragraph of generated text. Pick the most coherent one from 5-10 samples.
A README with:
- Model size (parameters).
- Training data (Shakespeare, Hacker News, your blog, etc.).
- Training hardware and wall time.
- Sample outputs.
- Honest assessment of capabilities and limitations.

What to keep private:

Training data with private content (your own writing, anything copyrighted).
API keys for any inference services you used for comparison.

Reviewer entry points:

model/transformer.py -- the architecture.
model/attention.py -- the attention mechanism.
train/loop.py -- the training loop.
README must include: training curves plot, sample paragraph, acknowledgement of Karpathy/Raschka as primary references.

A working GPT-from-scratch is a flagship portfolio piece. "I trained a 10M-parameter transformer that writes Shakespeare-flavored text from a corpus I prepared myself" is concrete, verifiable, and demonstrates depth beyond using a pre-trained API.

Honesty disclaimer

A 10M-parameter model from this tutorial is not ChatGPT. It is the architecture of ChatGPT, trained at 0.01% the scale. The right framing in your portfolio:

"I implemented a complete GPT-style transformer from scratch, trained on a small corpus. The architecture mirrors GPT-2; the scale is much smaller. The point is depth of understanding, not competitive model quality."

This honesty strengthens the portfolio because it shows technical maturity. Overclaiming weakens it.

13. Local source backbone

Use the local LLM chunks as a chapter map for a fuller semester pass:

Building LLMs From Scratch (build-your-own/building-llms-from-scratch)
2024 Build LLMs (build-your-own/llms-2024)
Build a Large Language Model From Scratch (build-your-own/large-language-model-raschka)

These sources should expand the project into a reproducible lab notebook, not replace Karpathy's minimal build.

Local chunks	Use them for	Add to this project
`building-llms-from-scratch-contents/002`-`008`	Big-picture LLM workflow, data setup, and tokenization foundations	Add a tokenizer design note comparing character, word, BPE, and GPT-style tokenization.
`009`-`018`	Embeddings, attention, causal masks, and transformer blocks	Add shape tables for every tensor in the forward pass.
`019`-`023`	Training loop, optimizer, evaluation, and sample generation	Add a reproducibility packet: seed, device, batch size, context length, train/val loss.
`024`-`032`	Fine-tuning, instruction tuning, and practical next steps	Add an extension path: domain continuation pretraining, then instruction tuning on a tiny curated set.
`2024-build-llms-contents/001`-`003`	Architecture summaries and technical slide responses	Use as review prompts after the transformer is working.
Raschka chunks	Longer-form implementation detail across tokenizer, GPT model, pretraining, and fine-tuning	Use as the deep reading path for learners who want book-length scaffolding.

Extra checkpoints from the book chunks

Tokenizer audit: train or implement a small tokenizer and show how it segments code, prose, numbers, and rare words.
Attention audit: print the causal mask and one attention matrix for a tiny batch; explain which tokens can attend to which prior tokens.
Scaling audit: run the same code at three model sizes and report loss, tokens/sec, memory use, and sample quality.
Fine-tuning audit: compare base-model samples and fine-tuned samples on the same prompts, then document failure modes.

14. Deep project spec

Project contract

Build a small GPT-style language model with a reproducible training packet. The minimum contract is tokenizer, dataset split, embeddings, causal self-attention, multi-head attention, feed-forward block, residual/LayerNorm stack, training loop, sampling, evaluation loss, and an honesty note about scale. RAG and fine-tuning are extensions.

Source-backed reading map

Source ID	Use for	Required output
`build-your-own/building-llms-from-scratch`	tokenizer, embeddings, attention, transformer block, training, fine-tuning	tokenizer audit, tensor-shape tables, training packet
`build-your-own/llms-2024`	architecture review and implementation prompts	review questions and design recap
`build-your-own/large-language-model-raschka`	book-length GPT implementation detail	deep reading path and optional fine-tuning checkpoints

Milestone map

Milestone	Deliverable	Tests	Failure case
Dataset/tokenizer	train/val split and tokenizer	encode/decode round trip	unknown/rare token behavior
Bigram baseline	simplest model and loss	baseline loss fixture	leakage between train/val
Attention	causal mask and attention weights	shape/mask tests	token attends to future
Transformer block	MHA, MLP, residual, LayerNorm	tensor-shape snapshots	unstable loss or NaNs
Training loop	optimizer, checkpoints, metrics	fixed-seed run	non-reproducible run
Sampling	temperature/top-k if included	prompt-output transcript	degenerate repetition
Scaling/fine-tuning extension	three model sizes or tiny instruction set	comparison report	overclaiming model ability

Test matrix

Test type	Required examples
Unit	tokenizer round trip, mask shape, logits shape
Numerical	attention probabilities sum correctly; no future-token leakage
Golden	tiny-batch forward pass shape table
Experiment	fixed config, seed, loss curve, samples at checkpoints
Benchmark	tokens/sec and memory use for at least two model sizes
Evaluation	base vs fine-tuned or baseline vs transformer comparison

Design notes required

tokenizer.md: tokenizer choice, examples, compression/coverage tradeoffs.
architecture.md: tensor shapes for every major operation.
training.md: corpus, split, seed, batch size, context length, optimizer, hardware.
limitations.md: scale, hallucination, data quality, and why this is not a production assistant.

Portfolio evidence

Publish the training config, loss curve, sample generations at multiple checkpoints, attention/mask visualization for a tiny batch, tokenizer audit, and explicit scale/limitation disclaimer.

Source

This tutorial draws from the BYO-X catalog "AI Model" section ("A Large Language Model"). Andrej Karpathy's "Let's build GPT" lecture, "Neural Networks: Zero to Hero" series, and Sebastian Raschka's Build a Large Language Model (From Scratch) are the canonical primary references.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

Bonus depth​

Theory (for deeper understanding)​

5. Curated tutorial list (from BYO-X)​

Additional canonical references​

RAG-specific (for the extension milestone)​

6. Recommended primary path​

7. Implementation milestones​

Milestone 1: Character-level tokenizer + bigram model​

Milestone 2: Self-attention​

Milestone 3: Multi-head attention + feed-forward + position embeddings​

Milestone 4: Transformer block (attention + MLP + residuals + LayerNorm)​

Milestone 5: Train at scale (10M parameters)​

Milestone 6 (Karpathy lecture 7): BPE tokenizer​

Milestone 7 (optional, ambitious): Pretrain a real model​

Milestone 8 (optional, extension): RAG​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

Honesty disclaimer​

13. Local source backbone​

Extra checkpoints from the book chunks​

14. Deep project spec​

Project contract​

Source-backed reading map​

Milestone map​

Test matrix​

Design notes required​

Portfolio evidence​

Source​