Skip to main content

Build Your Own Neural Network from Scratch

"I would never trust an idea I couldn't write the gradient for." -- Andrej Karpathy (paraphrased)

Implementing a neural network from scratch -- no PyTorch, no autograd library -- is the single best way to internalize what "backpropagation" actually computes. It turns ML from a black box into a few hundred lines of Python you understand line by line.


1. Overview & motivation

A neural network is a parameterized function f(x; θ) that you train by gradient descent on a loss L(f(x), y). Every modern ML framework is, under the hood, a system that:

  1. Lets you compose differentiable operations into a computational graph.
  2. Runs forward to compute the output.
  3. Runs backward to compute ∂L/∂θ for every parameter.
  4. Updates θ ← θ − α ∂L/∂θ.

What you can only learn by building one:

  • Why automatic differentiation is easy in principle (chain rule applied locally).
  • Why batch matrix multiply is the dominant cost of neural-network training.
  • Why activations like ReLU exist (gradient flow), and why deep nets needed innovations like residual connections.
  • Why the same gradient code that works for one neuron also works for GPT-class models -- autograd scales linearly with the number of operations.

2. Where this fits in the degree

  • Phase: Foundations
  • Semester: 1 (Linear algebra, statistics) -> 2 (Algorithms, DP)
  • Modules deepened: Sem 1 Module 4 (linear algebra foundations), Sem 1 Module 5 (statistics & inference), Sem 2 Module 4 (DP -- backprop is literally dynamic programming over a computational graph)

Cross-phase relevance:

  • Useful background for any modern data engineering, search ranking, or recommendation work
  • Connects directly to the "AI Model" category in BYO-X (Karpathy's "Build GPT from Scratch")

3. Prerequisites

  • Linear algebra: matrix multiplication, transpose, partial derivatives. Comfortable with dz/dx notation.
  • Python: NumPy. You should be able to write a matrix multiply with @.
  • Calculus: chain rule. That's it. No measure theory required.

If linear algebra feels shaky, finish Sem 1 Module 4 first.


4. Theory & research

Required reading

  • Andrej Karpathy, "Neural Networks: Zero to Hero" (karpathy.ai/zero-to-hero.html) -- video series. The single best resource for this project. Start with the micrograd video, then makemore.
  • Michael Nielsen, Neural Networks and Deep Learning (free online) -- Chapters 1-3 cover backprop and gradient descent at the right depth.
  • Goodfellow, Bengio, Courville, Deep Learning -- Chapter 6 (feedforward networks), Chapter 8 (optimization). The free PDF is at deeplearningbook.org.
  • 3Blue1Brown's neural network series (youtube.com/@3blue1brown) -- for visual intuition.

For the autograd part specifically

Historical / foundational papers

  • Rumelhart, Hinton, Williams (1986), "Learning representations by back-propagating errors" -- Nature paper that brought backprop into modern ML.
  • LeCun et al. (1998), "Gradient-Based Learning Applied to Document Recognition" -- the original CNN paper.

5. Curated tutorial list (from BYO-X)

  • C#: Neural Network OCR -- Andrew Kirillov
  • F#: Building Neural Networks in F#: Part 1 and Part 2 -- Mathias Brandewinder
  • Go: Build a multilayer perceptron with Golang, How to build a simple artificial neural network with Go, Building a Neural Net from Scratch in Go -- Daniel Whitenack and others
  • JavaScript / Java: Neural Networks - The Nature of Code [video] -- Daniel Shiffman
  • JavaScript: Neural networks from scratch for JavaScript linguists (Part1 -- The Perceptron)
  • Python: A Neural Network in 11 lines of Python -- iamtrask
  • Python: Implement a Neural Network from Scratch -- Denny Britz
  • Python: Optical Character Recognition (OCR)
  • Python: Traffic signs classification with a convolutional network
  • Python: Generate Music using LSTM Neural Network in Keras
  • Python: An Introduction to Convolutional Neural Networks
  • Python: Neural Networks: Zero to Hero -- Andrej Karpathy ⭐ recommended primary
  • Python: SlowTorch: Implementation of PyTorch from the ground up in 100% pure Python -- aliziaei/SlowTorch

Karpathy's "Neural Networks: Zero to Hero", Lecture 1 (micrograd) -> Lecture 2 (makemore: bigrams) -> Lecture 3 (MLP).

This is, by a wide margin, the best path. Why:

  • Lecture 1 builds an autograd engine in 150 lines and trains a tiny MLP on it. You internalize backprop before you ever touch a tensor library.
  • Lecture 2 builds bigram and MLP language models, so you also leave with a sense of what "language modeling" means.
  • Lecture 3 generalizes to a proper MLP with cross-entropy loss.

After these three lectures you will have implemented and understood every operation used by GPT, on a smaller scale.

If you want to keep going: lectures on backprop, batch norm, WaveNet, and finally GPT from scratch in lecture 7.


7. Implementation milestones

Milestone 1: Scalar autograd ("micrograd")

Implement a Value class that wraps a Python float. Track operations as a DAG. Implement backward() via topological sort.

class Value:
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0
self._backward = lambda: None
self._prev = set(_children)
self._op = _op

def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad
other.grad += out.grad
out._backward = _backward
return out

def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = _backward
return out

def tanh(self):
t = math.tanh(self.data)
out = Value(t, (self,), 'tanh')
def _backward():
self.grad += (1 - t * t) * out.grad
out._backward = _backward
return out

def backward(self):
topo = []
visited = set()
def build(v):
if v not in visited:
visited.add(v)
for child in v._prev:
build(child)
topo.append(v)
build(self)
self.grad = 1.0
for v in reversed(topo):
v._backward()

Evidence: Reproduce one of Karpathy's gradient verifications by computing dL/dw numerically and analytically -- they must agree to 6 decimal places.

Milestone 2: Tiny neural network on scalars

Build Neuron, Layer, MLP classes that compose Value objects. Train on a 4-point binary classification problem (Karpathy's exact example).

class Neuron:
def __init__(self, nin):
self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
self.b = Value(0.0)
def __call__(self, x):
act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
return act.tanh()

class Layer:
def __init__(self, nin, nout):
self.neurons = [Neuron(nin) for _ in range(nout)]
def __call__(self, x):
outs = [n(x) for n in self.neurons]
return outs[0] if len(outs) == 1 else outs

class MLP:
def __init__(self, nin, nouts):
sizes = [nin] + nouts
self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]
def __call__(self, x):
for layer in self.layers: x = layer(x)
return x

Evidence: Loss curve over 200 epochs, dropping from ~5 to under 0.01.

Milestone 3: Vectorize with NumPy

Scalar autograd is slow. Rewrite Value to operate on NumPy arrays. Now your gradients are matrix gradients.

This is the conceptual leap: the same chain rule applies. The shapes change.

class Tensor:
def __init__(self, data, _children=(), _op=''):
self.data = np.asarray(data, dtype=np.float32)
self.grad = np.zeros_like(self.data)
...

def __matmul__(self, other):
out = Tensor(self.data @ other.data, (self, other), '@')
def _backward():
self.grad += out.grad @ other.data.T
other.grad += self.data.T @ out.grad
out._backward = _backward
return out

Evidence: Train a 2-layer MLP on a subset of MNIST (e.g., 1000 examples). Reach >85% accuracy in under a minute.

Milestone 4: Training loop, mini-batches, optimizer

Add: dataset loader, mini-batching, SGD with momentum, learning rate schedule, train/test split.

for epoch in range(epochs):
for x_batch, y_batch in batches(X_train, y_train, batch_size):
logits = model(x_batch)
loss = cross_entropy(logits, y_batch)
for p in model.parameters(): p.grad = np.zeros_like(p.data)
loss.backward()
for p in model.parameters():
p.data -= lr * p.grad

Evidence: Train on full MNIST. Reach >95% test accuracy.

Milestone 5: Convolution layer

The hard one. Implement Conv2D forward and backward. Use im2col for the forward, transpose for the backward. Verify gradients with finite differences.

Evidence: Train a small CNN (Conv -> ReLU -> Pool -> FC) on MNIST. Reach >98%.

Milestone 6 (optional, ambitious): Tiny transformer / GPT

This is the natural next project. See the dedicated LLM (GPT from scratch) tutorial -- it covers Karpathy's "Build GPT" lecture in full, with milestones from bigram -> attention -> multi-head -> transformer block -> Shakespeare-quality output. ~300 lines of Python + PyTorch.

Evidence: Generated samples after 5,000 training steps that look more like Shakespeare than random characters.


8. Tests & evidence

TestHow
Gradient checkFor every operation, compute dL/dx analytically and via finite difference (L(x+ε) − L(x−ε)) / (2ε). Must agree to ~1e-5.
Overfit a single batchLoss should approach zero. If it doesn't, your gradients are wrong.
Compare against PyTorchRun the same input through both, compare outputs and gradients.
ReproducibilityFixed seed -> identical loss curve across runs.
Memory / runtimeTrack wall time per epoch and peak memory. Required for the writeup.

9. Common pitfalls

  • Forgetting to zero gradients before each backward pass. Gradients accumulate. Loops without p.grad = 0 reach NaN within a few steps.
  • In-place ops on Tensors that participate in the graph. Always create new Tensors for results.
  • Wrong shapes in the backward of @ (matmul). The transpose rule: dA = dC @ B.T, dB = A.T @ dC. Easy to swap.
  • Vanishing/exploding gradients with tanh or sigmoid deep nets. Use ReLU and proper init (He/Glorot).
  • Off-by-one in the topological sort. Backward must visit children before parents in the reverse direction -- equivalently, you visit parents before children in the forward DAG ordering.
  • Computing softmax + cross-entropy separately. Combine them into a single op for numerical stability (subtract max logit).
  • Treating learning rate as a constant. It is the single most important hyperparameter. Start with 1e-3 for Adam, 0.1 for SGD on small nets.

10. Extensions

  • Batch normalization -- Karpathy covers it; the gradient formula is fiddly but illuminating.
  • Adam optimizer -- keep running estimates of first and second moments of gradients.
  • Recurrent network (RNN, LSTM) -- backprop through time.
  • Transformer / self-attention -- the modern building block. See the dedicated LLM tutorial.
  • GPU support via CuPy -- drop-in NumPy replacement for Tensor.data.
  • Save/load model weights -- np.savez is enough.

11. Module integration

ModuleWhat the neural network deepens
Sem 1 Module 4 -- Linear algebraForward pass is dense matmul; backward is A.T @ ∂L/∂Y. Internalizes matrix shapes.
Sem 1 Module 5 -- StatisticsCross-entropy = negative log-likelihood. Connects ML loss to statistical inference.
Sem 2 Module 4 -- Dynamic programmingBackprop is DP over a DAG. The backward function is a memoized recursion.
Sem 2 Module 5 -- Advanced structuresTopological sort of the computational graph. Hash set for visited nodes.
LLM tutorialDirect continuation. Use this autograd engine to build a transformer.
3D Renderer tutorialSibling project: same Sem 1 math (linear algebra + probability), different domain.

12. Portfolio framing

What to publish:

  • The autograd engine (clearly separated from the training code).
  • A "from scratch -> PyTorch" comparison notebook: same model, same data, your code vs torch.nn.
  • One MNIST result with a confusion matrix.

What to keep private:

  • Pre-trained datasets that you don't have a license to redistribute.
  • Failed gradient-check sessions (keep them in notes -- they show learning).

Reviewer entry points:

  • README: start at engine.py, then nn.py, then train_mnist.py.
  • Call out the gradient-check test as the primary correctness evidence.

13. Local source backbone

Use Build Your Own Neural Networks: Step-by-Step (build-your-own/neural-networks-kilho-shin) as the slower, beginner-friendly companion to the Karpathy path. Do not treat it as a replacement for building the autograd engine; use it to make every math and training decision explicit.

Local chunksUse them forCurriculum insertion
002-004Neural-network vocabulary, applications, limitations, and model familiesBefore Milestone 1, write a one-page "what this model can and cannot learn" note.
006-010Python, NumPy arrays, broadcasting, and vectorized operationsAdd a NumPy warmup lab before implementing Value and tensor operations.
011-014Activation functions, weights, forward pass, backpropagationExpand the gradient-check milestone with one hand-worked forward/backward example.
015-018Training loop, validation split, debugging, and bias/varianceAdd training diagnostics: loss curves, overfit-one-batch evidence, and error analysis.
019-021Regularization, dropout, transfer learning, hyperparametersAdd a tuning checkpoint after MNIST: compare baseline, L2/dropout, and learning-rate schedules.
022-028CNNs, pooling, LSTMs, preprocessing, augmentationUse as optional extensions after the core MLP/autograd project is stable.
029-032Deployment, research habits, bias, privacy, AI ethicsAdd a final model card and data-risk note to the portfolio artifact.

Extra checkpoints from the book chunks

  1. Vectorization checkpoint: rewrite a scalar forward pass as batched NumPy, then prove the output is unchanged.
  2. Backprop checkpoint: compute one two-layer network gradient by hand, by finite difference, and by your engine.
  3. Generalization checkpoint: train the same model under underfit, fit, and overfit settings; explain the curves.
  4. Responsible-use checkpoint: document the dataset source, label risks, privacy assumptions, and known bias.

14. Deep project spec

Project contract

Build a neural-network implementation that makes the math inspectable. The minimum contract is scalar autograd, vectorized tensor operations where appropriate, a small MLP, gradient checking, a training loop, validation metrics, and one real dataset experiment. CNNs, RNNs, dropout, and deployment are extensions.

Source-backed reading map

Source IDUse forRequired output
build-your-own/neural-networks-kilho-shinmodel vocabulary, NumPy/vectorization, activations, backprop, training diagnostics, regularizationmath notes, gradient checks, training report

Milestone map

MilestoneDeliverableTestsFailure case
Scalar autogradValue graph and backward passderivative fixtureswrong topological order
Tensor/vector pathNumPy batched operationsscalar vs vector equivalencebroadcasting bug
MLPlayers, activations, lossoverfit tiny datasetdead activation or bad init
Gradient checkfinite-difference comparisonper-parameter tolerance testsgradient sign error
Training loopoptimizer, train/val split, metricsloss decreases on controlled tasktrain/val leakage
MNIST or equivalentreal dataset experimentaccuracy/loss reportoverfit/underfit diagnosis
Regularization extensionL2/dropout/augmentationbaseline comparisonregularization worsens result without explanation

Test matrix

Test typeRequired examples
Unitactivations, loss functions, tensor shapes
Numericalfinite-difference gradient checks
Goldenone hand-worked forward/backward example
Experimentfixed-seed training run with saved config
Failure analysisunderfit, fit, and overfit comparison

Design notes required

  • math.md: forward equations, backward equations, tensor shapes.
  • training.md: optimizer, initialization, batch size, learning rate, stopping rule.
  • experiment.md: dataset, split, metrics, seed, hardware, and interpretation.
  • model-card.md: intended use, limitations, data risks, and bias/privacy notes.

Portfolio evidence

Publish gradient-check output, a loss curve, one confusion/error analysis table, fixed-seed config, and a short explanation of the model's limits.


Source

This tutorial draws from the BYO-X catalog entries for "Neural Network" and "AI Model". Karpathy's "Neural Networks: Zero to Hero" is the strongly recommended primary path.