Build Your Own Neural Network from Scratch
"I would never trust an idea I couldn't write the gradient for." -- Andrej Karpathy (paraphrased)
Implementing a neural network from scratch -- no PyTorch, no autograd library -- is the single best way to internalize what "backpropagation" actually computes. It turns ML from a black box into a few hundred lines of Python you understand line by line.
1. Overview & motivation
A neural network is a parameterized function f(x; θ) that you train by gradient descent on a loss L(f(x), y). Every modern ML framework is, under the hood, a system that:
- Lets you compose differentiable operations into a computational graph.
- Runs forward to compute the output.
- Runs backward to compute
∂L/∂θfor every parameter. - Updates
θ ↠θ − α ∂L/∂θ.
What you can only learn by building one:
- Why automatic differentiation is easy in principle (chain rule applied locally).
- Why batch matrix multiply is the dominant cost of neural-network training.
- Why activations like ReLU exist (gradient flow), and why deep nets needed innovations like residual connections.
- Why the same gradient code that works for one neuron also works for GPT-class models -- autograd scales linearly with the number of operations.
2. Where this fits in the degree
- Phase: Foundations
- Semester: 1 (Linear algebra, statistics) -> 2 (Algorithms, DP)
- Modules deepened: Sem 1 Module 4 (linear algebra foundations), Sem 1 Module 5 (statistics & inference), Sem 2 Module 4 (DP -- backprop is literally dynamic programming over a computational graph)
Cross-phase relevance:
- Useful background for any modern data engineering, search ranking, or recommendation work
- Connects directly to the "AI Model" category in BYO-X (Karpathy's "Build GPT from Scratch")
3. Prerequisites
- Linear algebra: matrix multiplication, transpose, partial derivatives. Comfortable with
dz/dxnotation. - Python: NumPy. You should be able to write a matrix multiply with
@. - Calculus: chain rule. That's it. No measure theory required.
If linear algebra feels shaky, finish Sem 1 Module 4 first.
4. Theory & research
Required reading
- Andrej Karpathy, "Neural Networks: Zero to Hero" (karpathy.ai/zero-to-hero.html) -- video series. The single best resource for this project. Start with the micrograd video, then makemore.
- Michael Nielsen, Neural Networks and Deep Learning (free online) -- Chapters 1-3 cover backprop and gradient descent at the right depth.
Strongly recommended
- Goodfellow, Bengio, Courville, Deep Learning -- Chapter 6 (feedforward networks), Chapter 8 (optimization). The free PDF is at deeplearningbook.org.
- 3Blue1Brown's neural network series (youtube.com/@3blue1brown) -- for visual intuition.
For the autograd part specifically
- Karpathy's micrograd (github.com/karpathy/micrograd) -- 150 lines. Read every line. This is your scaffold.
Historical / foundational papers
- Rumelhart, Hinton, Williams (1986), "Learning representations by back-propagating errors" -- Nature paper that brought backprop into modern ML.
- LeCun et al. (1998), "Gradient-Based Learning Applied to Document Recognition" -- the original CNN paper.
5. Curated tutorial list (from BYO-X)
- C#: Neural Network OCR -- Andrew Kirillov
- F#: Building Neural Networks in F#: Part 1 and Part 2 -- Mathias Brandewinder
- Go: Build a multilayer perceptron with Golang, How to build a simple artificial neural network with Go, Building a Neural Net from Scratch in Go -- Daniel Whitenack and others
- JavaScript / Java: Neural Networks - The Nature of Code [video] -- Daniel Shiffman
- JavaScript: Neural networks from scratch for JavaScript linguists (Part1 -- The Perceptron)
- Python: A Neural Network in 11 lines of Python -- iamtrask
- Python: Implement a Neural Network from Scratch -- Denny Britz
- Python: Optical Character Recognition (OCR)
- Python: Traffic signs classification with a convolutional network
- Python: Generate Music using LSTM Neural Network in Keras
- Python: An Introduction to Convolutional Neural Networks
- Python: Neural Networks: Zero to Hero -- Andrej Karpathy â recommended primary
- Python: SlowTorch: Implementation of PyTorch from the ground up in 100% pure Python -- aliziaei/SlowTorch
6. Recommended primary path
Karpathy's "Neural Networks: Zero to Hero", Lecture 1 (micrograd) -> Lecture 2 (makemore: bigrams) -> Lecture 3 (MLP).
This is, by a wide margin, the best path. Why:
- Lecture 1 builds an autograd engine in 150 lines and trains a tiny MLP on it. You internalize backprop before you ever touch a tensor library.
- Lecture 2 builds bigram and MLP language models, so you also leave with a sense of what "language modeling" means.
- Lecture 3 generalizes to a proper MLP with cross-entropy loss.
After these three lectures you will have implemented and understood every operation used by GPT, on a smaller scale.
If you want to keep going: lectures on backprop, batch norm, WaveNet, and finally GPT from scratch in lecture 7.
7. Implementation milestones
Milestone 1: Scalar autograd ("micrograd")
Implement a Value class that wraps a Python float. Track operations as a DAG. Implement backward() via topological sort.
class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out
def tanh(self):
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t * t) * out.grad
out._backward = _backward
return out
def backward(self):
topo = []
visited = set()
def build(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build(child)
topo.append(v)
build(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()
Evidence: Reproduce one of Karpathy's gradient verifications by computing dL/dw numerically and analytically -- they must agree to 6 decimal places.
Milestone 2: Tiny neural network on scalars
Build Neuron, Layer, MLP classes that compose Value objects. Train on a 4-point binary classification problem (Karpathy's exact example).
class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
self.b = Value(0.0)
def __call__(self, x):
act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
return act.tanh()
class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
outs = [n(x) for n in self.neurons]
return outs[0] if len(outs) == 1 else outs
class MLP:
def __init__(self, nin, nouts):
sizes = [nin] + nouts
self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers: x = layer(x)
return x
Evidence: Loss curve over 200 epochs, dropping from ~5 to under 0.01.
Milestone 3: Vectorize with NumPy
Scalar autograd is slow. Rewrite Value to operate on NumPy arrays. Now your gradients are matrix gradients.
This is the conceptual leap: the same chain rule applies. The shapes change.
class Tensor:
def __init__(self, data, _children=(), _op=''):
self.data = np.asarray(data, dtype=np.float32)
self.grad = np.zeros_like(self.data)
...
def __matmul__(self, other):
out = Tensor(self.data @ other.data, (self, other), '@')
def _backward():
self.grad += out.grad @ other.data.T
other.grad += self.data.T @ out.grad
out._backward = _backward
return out
Evidence: Train a 2-layer MLP on a subset of MNIST (e.g., 1000 examples). Reach >85% accuracy in under a minute.
Milestone 4: Training loop, mini-batches, optimizer
Add: dataset loader, mini-batching, SGD with momentum, learning rate schedule, train/test split.
for epoch in range(epochs):
for x_batch, y_batch in batches(X_train, y_train, batch_size):
logits = model(x_batch)
loss = cross_entropy(logits, y_batch)
for p in model.parameters(): p.grad = np.zeros_like(p.data)
loss.backward()
for p in model.parameters():
p.data -= lr * p.grad
Evidence: Train on full MNIST. Reach >95% test accuracy.
Milestone 5: Convolution layer
The hard one. Implement Conv2D forward and backward. Use im2col for the forward, transpose for the backward. Verify gradients with finite differences.
Evidence: Train a small CNN (Conv -> ReLU -> Pool -> FC) on MNIST. Reach >98%.
Milestone 6 (optional, ambitious): Tiny transformer / GPT
This is the natural next project. See the dedicated LLM (GPT from scratch) tutorial -- it covers Karpathy's "Build GPT" lecture in full, with milestones from bigram -> attention -> multi-head -> transformer block -> Shakespeare-quality output. ~300 lines of Python + PyTorch.
Evidence: Generated samples after 5,000 training steps that look more like Shakespeare than random characters.
8. Tests & evidence
| Test | How |
|---|---|
| Gradient check | For every operation, compute dL/dx analytically and via finite difference (L(x+ε) − L(x−ε)) / (2ε). Must agree to ~1e-5. |
| Overfit a single batch | Loss should approach zero. If it doesn't, your gradients are wrong. |
| Compare against PyTorch | Run the same input through both, compare outputs and gradients. |
| Reproducibility | Fixed seed -> identical loss curve across runs. |
| Memory / runtime | Track wall time per epoch and peak memory. Required for the writeup. |
9. Common pitfalls
- Forgetting to zero gradients before each backward pass. Gradients accumulate. Loops without
p.grad = 0reach NaN within a few steps. - In-place ops on Tensors that participate in the graph. Always create new Tensors for results.
- Wrong shapes in the backward of
@(matmul). The transpose rule:dA = dC @ B.T,dB = A.T @ dC. Easy to swap. - Vanishing/exploding gradients with
tanhorsigmoiddeep nets. Use ReLU and proper init (He/Glorot). - Off-by-one in the topological sort. Backward must visit children before parents in the reverse direction -- equivalently, you visit parents before children in the forward DAG ordering.
- Computing softmax + cross-entropy separately. Combine them into a single op for numerical stability (subtract max logit).
- Treating learning rate as a constant. It is the single most important hyperparameter. Start with
1e-3for Adam,0.1for SGD on small nets.
10. Extensions
- Batch normalization -- Karpathy covers it; the gradient formula is fiddly but illuminating.
- Adam optimizer -- keep running estimates of first and second moments of gradients.
- Recurrent network (RNN, LSTM) -- backprop through time.
- Transformer / self-attention -- the modern building block. See the dedicated LLM tutorial.
- GPU support via CuPy -- drop-in NumPy replacement for
Tensor.data. - Save/load model weights --
np.savezis enough.
11. Module integration
| Module | What the neural network deepens |
|---|---|
| Sem 1 Module 4 -- Linear algebra | Forward pass is dense matmul; backward is A.T @ ∂L/∂Y. Internalizes matrix shapes. |
| Sem 1 Module 5 -- Statistics | Cross-entropy = negative log-likelihood. Connects ML loss to statistical inference. |
| Sem 2 Module 4 -- Dynamic programming | Backprop is DP over a DAG. The backward function is a memoized recursion. |
| Sem 2 Module 5 -- Advanced structures | Topological sort of the computational graph. Hash set for visited nodes. |
| LLM tutorial | Direct continuation. Use this autograd engine to build a transformer. |
| 3D Renderer tutorial | Sibling project: same Sem 1 math (linear algebra + probability), different domain. |
12. Portfolio framing
What to publish:
- The autograd engine (clearly separated from the training code).
- A "from scratch -> PyTorch" comparison notebook: same model, same data, your code vs
torch.nn. - One MNIST result with a confusion matrix.
What to keep private:
- Pre-trained datasets that you don't have a license to redistribute.
- Failed gradient-check sessions (keep them in notes -- they show learning).
Reviewer entry points:
- README: start at
engine.py, thennn.py, thentrain_mnist.py. - Call out the gradient-check test as the primary correctness evidence.
13. Local source backbone
Use Build Your Own Neural Networks: Step-by-Step (build-your-own/neural-networks-kilho-shin) as the slower, beginner-friendly companion to the Karpathy path. Do not treat it as a replacement for building the autograd engine; use it to make every math and training decision explicit.
| Local chunks | Use them for | Curriculum insertion |
|---|---|---|
002-004 | Neural-network vocabulary, applications, limitations, and model families | Before Milestone 1, write a one-page "what this model can and cannot learn" note. |
006-010 | Python, NumPy arrays, broadcasting, and vectorized operations | Add a NumPy warmup lab before implementing Value and tensor operations. |
011-014 | Activation functions, weights, forward pass, backpropagation | Expand the gradient-check milestone with one hand-worked forward/backward example. |
015-018 | Training loop, validation split, debugging, and bias/variance | Add training diagnostics: loss curves, overfit-one-batch evidence, and error analysis. |
019-021 | Regularization, dropout, transfer learning, hyperparameters | Add a tuning checkpoint after MNIST: compare baseline, L2/dropout, and learning-rate schedules. |
022-028 | CNNs, pooling, LSTMs, preprocessing, augmentation | Use as optional extensions after the core MLP/autograd project is stable. |
029-032 | Deployment, research habits, bias, privacy, AI ethics | Add a final model card and data-risk note to the portfolio artifact. |
Extra checkpoints from the book chunks
- Vectorization checkpoint: rewrite a scalar forward pass as batched NumPy, then prove the output is unchanged.
- Backprop checkpoint: compute one two-layer network gradient by hand, by finite difference, and by your engine.
- Generalization checkpoint: train the same model under underfit, fit, and overfit settings; explain the curves.
- Responsible-use checkpoint: document the dataset source, label risks, privacy assumptions, and known bias.
14. Deep project spec
Project contract
Build a neural-network implementation that makes the math inspectable. The minimum contract is scalar autograd, vectorized tensor operations where appropriate, a small MLP, gradient checking, a training loop, validation metrics, and one real dataset experiment. CNNs, RNNs, dropout, and deployment are extensions.
Source-backed reading map
| Source ID | Use for | Required output |
|---|---|---|
build-your-own/neural-networks-kilho-shin | model vocabulary, NumPy/vectorization, activations, backprop, training diagnostics, regularization | math notes, gradient checks, training report |
Milestone map
| Milestone | Deliverable | Tests | Failure case |
|---|---|---|---|
| Scalar autograd | Value graph and backward pass | derivative fixtures | wrong topological order |
| Tensor/vector path | NumPy batched operations | scalar vs vector equivalence | broadcasting bug |
| MLP | layers, activations, loss | overfit tiny dataset | dead activation or bad init |
| Gradient check | finite-difference comparison | per-parameter tolerance tests | gradient sign error |
| Training loop | optimizer, train/val split, metrics | loss decreases on controlled task | train/val leakage |
| MNIST or equivalent | real dataset experiment | accuracy/loss report | overfit/underfit diagnosis |
| Regularization extension | L2/dropout/augmentation | baseline comparison | regularization worsens result without explanation |
Test matrix
| Test type | Required examples |
|---|---|
| Unit | activations, loss functions, tensor shapes |
| Numerical | finite-difference gradient checks |
| Golden | one hand-worked forward/backward example |
| Experiment | fixed-seed training run with saved config |
| Failure analysis | underfit, fit, and overfit comparison |
Design notes required
math.md: forward equations, backward equations, tensor shapes.training.md: optimizer, initialization, batch size, learning rate, stopping rule.experiment.md: dataset, split, metrics, seed, hardware, and interpretation.model-card.md: intended use, limitations, data risks, and bias/privacy notes.
Portfolio evidence
Publish gradient-check output, a loss curve, one confusion/error analysis table, fixed-seed config, and a short explanation of the model's limits.
Source
This tutorial draws from the BYO-X catalog entries for "Neural Network" and "AI Model". Karpathy's "Neural Networks: Zero to Hero" is the strongly recommended primary path.