Build Your Own Neural Network from Scratch

"I would never trust an idea I couldn't write the gradient for." -- Andrej Karpathy (paraphrased)

Implementing a neural network from scratch -- no PyTorch, no autograd library -- is the single best way to internalize what "backpropagation" actually computes. It turns ML from a black box into a few hundred lines of Python you understand line by line.

1. Overview & motivation

A neural network is a parameterized function f(x; Î¸) that you train by gradient descent on a loss L(f(x), y). Every modern ML framework is, under the hood, a system that:

Lets you compose differentiable operations into a computational graph.
Runs forward to compute the output.
Runs backward to compute âˆ‚L/âˆ‚Î¸ for every parameter.
Updates Î¸ â† Î¸ âˆ’ Î± âˆ‚L/âˆ‚Î¸.

What you can only learn by building one:

Why automatic differentiation is easy in principle (chain rule applied locally).
Why batch matrix multiply is the dominant cost of neural-network training.
Why activations like ReLU exist (gradient flow), and why deep nets needed innovations like residual connections.
Why the same gradient code that works for one neuron also works for GPT-class models -- autograd scales linearly with the number of operations.

2. Where this fits in the degree

Phase: Foundations
Semester: 1 (Linear algebra, statistics) -> 2 (Algorithms, DP)
Modules deepened: Sem 1 Module 4 (linear algebra foundations), Sem 1 Module 5 (statistics & inference), Sem 2 Module 4 (DP -- backprop is literally dynamic programming over a computational graph)

Cross-phase relevance:

Useful background for any modern data engineering, search ranking, or recommendation work
Connects directly to the "AI Model" category in BYO-X (Karpathy's "Build GPT from Scratch")

3. Prerequisites

Linear algebra: matrix multiplication, transpose, partial derivatives. Comfortable with dz/dx notation.
Python: NumPy. You should be able to write a matrix multiply with @.
Calculus: chain rule. That's it. No measure theory required.

If linear algebra feels shaky, finish Sem 1 Module 4 first.

4. Theory & research

Required reading

Andrej Karpathy, "Neural Networks: Zero to Hero" (karpathy.ai/zero-to-hero.html) -- video series. The single best resource for this project. Start with the micrograd video, then makemore.
Michael Nielsen, Neural Networks and Deep Learning (free online) -- Chapters 1-3 cover backprop and gradient descent at the right depth.

Strongly recommended

Goodfellow, Bengio, Courville, Deep Learning -- Chapter 6 (feedforward networks), Chapter 8 (optimization). The free PDF is at deeplearningbook.org.
3Blue1Brown's neural network series (youtube.com/@3blue1brown) -- for visual intuition.

For the autograd part specifically

Karpathy's micrograd (github.com/karpathy/micrograd) -- 150 lines. Read every line. This is your scaffold.

Historical / foundational papers

Rumelhart, Hinton, Williams (1986), "Learning representations by back-propagating errors" -- Nature paper that brought backprop into modern ML.
LeCun et al. (1998), "Gradient-Based Learning Applied to Document Recognition" -- the original CNN paper.

5. Curated tutorial list (from BYO-X)

C#: Neural Network OCR -- Andrew Kirillov
F#: Building Neural Networks in F#: Part 1 and Part 2 -- Mathias Brandewinder
Go: Build a multilayer perceptron with Golang, How to build a simple artificial neural network with Go, Building a Neural Net from Scratch in Go -- Daniel Whitenack and others
JavaScript / Java: Neural Networks - The Nature of Code [video] -- Daniel Shiffman
JavaScript: Neural networks from scratch for JavaScript linguists (Part1 -- The Perceptron)
Python: A Neural Network in 11 lines of Python -- iamtrask
Python: Implement a Neural Network from Scratch -- Denny Britz
Python: Optical Character Recognition (OCR)
Python: Traffic signs classification with a convolutional network
Python: Generate Music using LSTM Neural Network in Keras
Python: An Introduction to Convolutional Neural Networks
Python: Neural Networks: Zero to Hero -- Andrej Karpathy â recommended primary
Python: SlowTorch: Implementation of PyTorch from the ground up in 100% pure Python -- aliziaei/SlowTorch

6. Recommended primary path

Karpathy's "Neural Networks: Zero to Hero", Lecture 1 (micrograd) -> Lecture 2 (makemore: bigrams) -> Lecture 3 (MLP).

This is, by a wide margin, the best path. Why:

Lecture 1 builds an autograd engine in 150 lines and trains a tiny MLP on it. You internalize backprop before you ever touch a tensor library.
Lecture 2 builds bigram and MLP language models, so you also leave with a sense of what "language modeling" means.
Lecture 3 generalizes to a proper MLP with cross-entropy loss.

After these three lectures you will have implemented and understood every operation used by GPT, on a smaller scale.

If you want to keep going: lectures on backprop, batch norm, WaveNet, and finally GPT from scratch in lecture 7.

7. Implementation milestones

Milestone 1: Scalar autograd ("micrograd")

Implement a Value class that wraps a Python float. Track operations as a DAG. Implement backward() via topological sort.

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t * t) * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo = []
        visited = set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build(child)
                topo.append(v)
        build(self)
        self.grad = 1.0
        for v in reversed(topo):
            v._backward()

Evidence: Reproduce one of Karpathy's gradient verifications by computing dL/dw numerically and analytically -- they must agree to 6 decimal places.

Milestone 2: Tiny neural network on scalars

Build Neuron, Layer, MLP classes that compose Value objects. Train on a 4-point binary classification problem (Karpathy's exact example).

class Neuron:
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)
    def __call__(self, x):
        act = sum((wi*xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()

class Layer:
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]
    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs

class MLP:
    def __init__(self, nin, nouts):
        sizes = [nin] + nouts
        self.layers = [Layer(sizes[i], sizes[i+1]) for i in range(len(nouts))]
    def __call__(self, x):
        for layer in self.layers: x = layer(x)
        return x

Evidence: Loss curve over 200 epochs, dropping from ~5 to under 0.01.

Milestone 3: Vectorize with NumPy

Scalar autograd is slow. Rewrite Value to operate on NumPy arrays. Now your gradients are matrix gradients.

This is the conceptual leap: the same chain rule applies. The shapes change.

class Tensor:
    def __init__(self, data, _children=(), _op=''):
        self.data = np.asarray(data, dtype=np.float32)
        self.grad = np.zeros_like(self.data)
        ...

    def __matmul__(self, other):
        out = Tensor(self.data @ other.data, (self, other), '@')
        def _backward():
            self.grad += out.grad @ other.data.T
            other.grad += self.data.T @ out.grad
        out._backward = _backward
        return out

Evidence: Train a 2-layer MLP on a subset of MNIST (e.g., 1000 examples). Reach >85% accuracy in under a minute.

Milestone 4: Training loop, mini-batches, optimizer

Add: dataset loader, mini-batching, SGD with momentum, learning rate schedule, train/test split.

for epoch in range(epochs):
    for x_batch, y_batch in batches(X_train, y_train, batch_size):
        logits = model(x_batch)
        loss = cross_entropy(logits, y_batch)
        for p in model.parameters(): p.grad = np.zeros_like(p.data)
        loss.backward()
        for p in model.parameters():
            p.data -= lr * p.grad

Evidence: Train on full MNIST. Reach >95% test accuracy.

Milestone 5: Convolution layer

The hard one. Implement Conv2D forward and backward. Use im2col for the forward, transpose for the backward. Verify gradients with finite differences.

Evidence: Train a small CNN (Conv -> ReLU -> Pool -> FC) on MNIST. Reach >98%.

Milestone 6 (optional, ambitious): Tiny transformer / GPT

This is the natural next project. See the dedicated LLM (GPT from scratch) tutorial -- it covers Karpathy's "Build GPT" lecture in full, with milestones from bigram -> attention -> multi-head -> transformer block -> Shakespeare-quality output. ~300 lines of Python + PyTorch.

Evidence: Generated samples after 5,000 training steps that look more like Shakespeare than random characters.

8. Tests & evidence

Test	How
Gradient check	For every operation, compute `dL/dx` analytically and via finite difference `(L(x+Îµ) âˆ’ L(xâˆ’Îµ)) / (2Îµ)`. Must agree to ~1e-5.
Overfit a single batch	Loss should approach zero. If it doesn't, your gradients are wrong.
Compare against PyTorch	Run the same input through both, compare outputs and gradients.
Reproducibility	Fixed seed -> identical loss curve across runs.
Memory / runtime	Track wall time per epoch and peak memory. Required for the writeup.

9. Common pitfalls

Forgetting to zero gradients before each backward pass. Gradients accumulate. Loops without p.grad = 0 reach NaN within a few steps.
In-place ops on Tensors that participate in the graph. Always create new Tensors for results.
Wrong shapes in the backward of @ (matmul). The transpose rule: dA = dC @ B.T, dB = A.T @ dC. Easy to swap.
Vanishing/exploding gradients with tanh or sigmoid deep nets. Use ReLU and proper init (He/Glorot).
Off-by-one in the topological sort. Backward must visit children before parents in the reverse direction -- equivalently, you visit parents before children in the forward DAG ordering.
Computing softmax + cross-entropy separately. Combine them into a single op for numerical stability (subtract max logit).
Treating learning rate as a constant. It is the single most important hyperparameter. Start with 1e-3 for Adam, 0.1 for SGD on small nets.

10. Extensions

Batch normalization -- Karpathy covers it; the gradient formula is fiddly but illuminating.
Adam optimizer -- keep running estimates of first and second moments of gradients.
Recurrent network (RNN, LSTM) -- backprop through time.
Transformer / self-attention -- the modern building block. See the dedicated LLM tutorial.
GPU support via CuPy -- drop-in NumPy replacement for Tensor.data.
Save/load model weights -- np.savez is enough.

11. Module integration

Module	What the neural network deepens
Sem 1 Module 4 -- Linear algebra	Forward pass is dense matmul; backward is `A.T @ âˆ‚L/âˆ‚Y`. Internalizes matrix shapes.
Sem 1 Module 5 -- Statistics	Cross-entropy = negative log-likelihood. Connects ML loss to statistical inference.
Sem 2 Module 4 -- Dynamic programming	Backprop is DP over a DAG. The `backward` function is a memoized recursion.
Sem 2 Module 5 -- Advanced structures	Topological sort of the computational graph. Hash set for visited nodes.
LLM tutorial	Direct continuation. Use this autograd engine to build a transformer.
3D Renderer tutorial	Sibling project: same Sem 1 math (linear algebra + probability), different domain.

12. Portfolio framing

What to publish:

The autograd engine (clearly separated from the training code).
A "from scratch -> PyTorch" comparison notebook: same model, same data, your code vs torch.nn.
One MNIST result with a confusion matrix.

What to keep private:

Pre-trained datasets that you don't have a license to redistribute.
Failed gradient-check sessions (keep them in notes -- they show learning).

Reviewer entry points:

README: start at engine.py, then nn.py, then train_mnist.py.
Call out the gradient-check test as the primary correctness evidence.

13. Local source backbone

Use Build Your Own Neural Networks: Step-by-Step (build-your-own/neural-networks-kilho-shin) as the slower, beginner-friendly companion to the Karpathy path. Do not treat it as a replacement for building the autograd engine; use it to make every math and training decision explicit.

Local chunks	Use them for	Curriculum insertion
`002`-`004`	Neural-network vocabulary, applications, limitations, and model families	Before Milestone 1, write a one-page "what this model can and cannot learn" note.
`006`-`010`	Python, NumPy arrays, broadcasting, and vectorized operations	Add a NumPy warmup lab before implementing `Value` and tensor operations.
`011`-`014`	Activation functions, weights, forward pass, backpropagation	Expand the gradient-check milestone with one hand-worked forward/backward example.
`015`-`018`	Training loop, validation split, debugging, and bias/variance	Add training diagnostics: loss curves, overfit-one-batch evidence, and error analysis.
`019`-`021`	Regularization, dropout, transfer learning, hyperparameters	Add a tuning checkpoint after MNIST: compare baseline, L2/dropout, and learning-rate schedules.
`022`-`028`	CNNs, pooling, LSTMs, preprocessing, augmentation	Use as optional extensions after the core MLP/autograd project is stable.
`029`-`032`	Deployment, research habits, bias, privacy, AI ethics	Add a final model card and data-risk note to the portfolio artifact.

Extra checkpoints from the book chunks

Vectorization checkpoint: rewrite a scalar forward pass as batched NumPy, then prove the output is unchanged.
Backprop checkpoint: compute one two-layer network gradient by hand, by finite difference, and by your engine.
Generalization checkpoint: train the same model under underfit, fit, and overfit settings; explain the curves.
Responsible-use checkpoint: document the dataset source, label risks, privacy assumptions, and known bias.

14. Deep project spec

Project contract

Build a neural-network implementation that makes the math inspectable. The minimum contract is scalar autograd, vectorized tensor operations where appropriate, a small MLP, gradient checking, a training loop, validation metrics, and one real dataset experiment. CNNs, RNNs, dropout, and deployment are extensions.

Source-backed reading map

Source ID	Use for	Required output
`build-your-own/neural-networks-kilho-shin`	model vocabulary, NumPy/vectorization, activations, backprop, training diagnostics, regularization	math notes, gradient checks, training report

Milestone map

Milestone	Deliverable	Tests	Failure case
Scalar autograd	`Value` graph and backward pass	derivative fixtures	wrong topological order
Tensor/vector path	NumPy batched operations	scalar vs vector equivalence	broadcasting bug
MLP	layers, activations, loss	overfit tiny dataset	dead activation or bad init
Gradient check	finite-difference comparison	per-parameter tolerance tests	gradient sign error
Training loop	optimizer, train/val split, metrics	loss decreases on controlled task	train/val leakage
MNIST or equivalent	real dataset experiment	accuracy/loss report	overfit/underfit diagnosis
Regularization extension	L2/dropout/augmentation	baseline comparison	regularization worsens result without explanation

Test matrix

Test type	Required examples
Unit	activations, loss functions, tensor shapes
Numerical	finite-difference gradient checks
Golden	one hand-worked forward/backward example
Experiment	fixed-seed training run with saved config
Failure analysis	underfit, fit, and overfit comparison

Design notes required

math.md: forward equations, backward equations, tensor shapes.
training.md: optimizer, initialization, batch size, learning rate, stopping rule.
experiment.md: dataset, split, metrics, seed, hardware, and interpretation.
model-card.md: intended use, limitations, data risks, and bias/privacy notes.

Portfolio evidence

Publish gradient-check output, a loss curve, one confusion/error analysis table, fixed-seed config, and a short explanation of the model's limits.

Source

This tutorial draws from the BYO-X catalog entries for "Neural Network" and "AI Model". Karpathy's "Neural Networks: Zero to Hero" is the strongly recommended primary path.

1. Overview & motivation​

2. Where this fits in the degree​

3. Prerequisites​

4. Theory & research​

Required reading​

Strongly recommended​

For the autograd part specifically​

Historical / foundational papers​

5. Curated tutorial list (from BYO-X)​

6. Recommended primary path​

7. Implementation milestones​

Milestone 1: Scalar autograd ("micrograd")​

Milestone 2: Tiny neural network on scalars​

Milestone 3: Vectorize with NumPy​

Milestone 4: Training loop, mini-batches, optimizer​

Milestone 5: Convolution layer​

Milestone 6 (optional, ambitious): Tiny transformer / GPT​

8. Tests & evidence​

9. Common pitfalls​

10. Extensions​

11. Module integration​

12. Portfolio framing​

13. Local source backbone​

Extra checkpoints from the book chunks​

14. Deep project spec​

Project contract​

Source-backed reading map​

Milestone map​

Test matrix​

Design notes required​

Portfolio evidence​

Source​