1 Introduction

On the next-to-last page of my first book, I wrote that I dreamed of a future where we could take our code and compile it for whatever backend we desire, via MLIR. I meant it as a wishlist item — something I expected to happen eventually, but not soon enough to matter for that work.

It turned out “eventually” was about five years.

This is the book I wanted to write back then, and couldn’t. The tools weren’t ready. Lean 4 was still in development. MLIR was a research project. IREE couldn’t compile a training step. Now all three are production-quality, and the thing I wished for is not only possible — it’s sitting in a GitHub repo, training ResNets on a single GPU, with machine-checked proofs that every gradient is correct.

What this book is about

This book teaches you how backpropagation actually works.

Not the hand-wave version — “compute gradients, update weights, repeat.” You already know that version. If you’ve trained a model, you’ve called .backward() a thousand times. What you probably haven’t done is open the hood and look at what .backward() is actually computing, layer by layer, and verify that the computation is mathematically correct.

That’s what we’re going to do. For every layer in the modern stack — dense, ReLU, convolution, pooling, batch normalization, residual connections, depthwise convolution, squeeze-and-excitation, layer normalization, and self-attention — we will:

Derive the backward pass from the forward pass, using nothing but the chain rule and some index algebra.
State the result formally in Lean 4, a programming language that doubles as a theorem prover.
Show the MLIR that compiles to GPU bytecode, and verify it matches the math.
Train the model on real data, end-to-end, on your hardware.

Three views of the same computation: the math, the proof, the code that runs. They agree, and the entire book is about understanding why.

The thesis

Backpropagation is one operation (the vector-Jacobian product), composed using three rules:

The chain rule — if you compose two functions, you compose their backward passes.
Additive fan-in — if two paths feed into a sum (like a residual connection), their gradients add.
Multiplicative fan-in — if two paths feed into a product (like squeeze-and-excitation or attention), the product rule gives you two backward contributions that add.

And five structural tricks for layers whose Jacobians are dense but exploitably structured:

Diagonal — elementwise activations (ReLU, GELU, Swish). The Jacobian is diagonal; the backward is one multiply.
Sparse Toeplitz — convolutions. The Jacobian is block-sparse with a sliding-window structure; the backward is another convolution with a reversed kernel.
Binary selection — max pooling. The Jacobian routes the gradient to whichever input was the max.
Rank-1 correction to diagonal — softmax, batch norm, layer norm. The Jacobian is dense but has a clean closed form because it’s a diagonal plus a rank-1 outer product.
Outer product reductions — dense layers and matmul. The weight gradient is the outer product of the input with the output gradient.

That’s it. Three rules, five tricks. Every architecture in this book — MLP, CNN, ResNet, MobileNet, EfficientNet, Vision Transformer — and every architecture in the bestiary — U-Net, DETR, Mamba, GAN, VAE, diffusion models, CLIP, GPT-style decoders — decomposes into some combination of these eight things. There is no ninth.

I didn’t set out to prove this. I set out to train a ResNet with verified gradients, and by the time I’d done it for enough architectures, the pattern became impossible to ignore. The framework kept working without needing new primitives.

Who this book is for

This book assumes you’ve trained a neural network. Not a fancy one. If you’ve fine-tuned a pretrained model, debugged a NaN loss, or watched a model overfit and wondered why, you’re the audience.

I’m going to assume you know what a derivative is, what a matrix is, and roughly what a convolutional layer does. I’m not going to assume you know Lean, MLIR, formal verification, or compiler theory. The Lean code appears throughout the book as a precise specification and a credibility instrument — not as a barrier. If you skip every Lean snippet, you still get a complete book on the mechanics of backpropagation. The Lean is there for the reader who wants to check me, not the reader who needs to learn formal methods.

This is not a beginner’s book. If you’re looking for your first introduction to neural networks, there are several good ones: my first book is one option, fast.ai’s free courses are another, and there are good resources all over the web. Come back to this one when you’ve trained something and want to understand what you did.

This is also not a research textbook. I’m not trying to write Goodfellow or Bishop. The proofs here are for credibility, not for theoretical depth. If you want the full formal story, Mathlib is open-source and you can extend it. What this book gives you that a research textbook doesn’t is the bridge from the math to the actual GPU code, and the assurance that they agree.

If you’re me from ten years ago — interested in ML, willing to put in the work, not sure where the math stops and the folklore begins — this is the book I wish I’d had.

Why now

Three things had to happen before this book was possible, and all three happened in the last few years:

Lean 4 reached maturity. Lean is a programming language that is also a theorem prover. You can write a function in Lean and then write a proof that the function does what you claim. If the proof compiles, the claim is correct — not because you trust the author, but because the compiler checked it. Lean 4 (released 2023) is the version that made this practical for non-specialists: it has a real package manager, real tooling, a growing math library (Mathlib), and an active community. What unblocked this book in particular was Mathlib’s Fréchet-derivative API (fderiv) — first-order calculus contributed by Avigad in 2019, with the higher-order $C^n$ scaffolding around it stabilized by Gouëzel over 2019–2025. Without that API — chain rule, product rule, continuous-linear-map composition — the central claim of this book, that the backward pass equals the true Jacobian, would not be a theorem; it would be a wish.

MLIR and StableHLO stabilized. MLIR is a compiler infrastructure for machine learning. StableHLO is a portable operation set built on MLIR that describes the computations neural networks perform. Together, they let you specify a training step as a sequence of mathematical operations and compile it to any hardware backend — CPU, CUDA, ROCm — without writing backend-specific code. This is the layer that turns “math on paper” into “code on the GPU.”

IREE became production-ready. IREE (Intermediate Representation Execution Environment) is the runtime that takes a compiled StableHLO module and executes it on actual hardware. It handles memory management, device allocation, and kernel dispatch. It’s what lets us go from a Lean spec file to a running training loop with no Python anywhere in the pipeline.

The combination of these three tools means, for the first time, that a single person can: (a) write a neural network spec in a language with a type system strong enough to state and check mathematical properties; (b) compile that spec to GPU-runnable code via a standard intermediate representation; and (c) train the network end-to-end on consumer hardware, with formal guarantees about the gradient computation. That’s the pipeline this book is built around.

Why image recognition

Same reason as the first book: it’s the oldest, most well-understood corner of deep learning, which means we can introduce primitives one at a time in a logical order.

We start with MNIST (digits, $28 \times 28$ grayscale) and work up through CIFAR-10 (color, $32 \times 32$) to Imagenette (real photos, $224 \times 224$). The architectures form a natural progression: MLP $\to $ CNN $\to $ ResNet $\to $ MobileNet $\to $ EfficientNet $\to $ ViT. Each one adds exactly one new structural primitive to the framework. By the time you reach ViT, you’ve seen every primitive there is to see for modern vision — and as it turns out, for modern language models too, since transformers are the same architecture in both domains.

The bestiary at the end demonstrates that the same primitives cover detection (DETR), segmentation (U-Net), generation (diffusion models), self-supervised learning (MAE, CLIP), sequence modeling (Mamba), and multi-modal models (LLaVA). Image recognition is the on-ramp; the destination is the whole modern stack.

Why Lean

I considered several options for the “formal” half of the book — Rocq, Agda, Isabelle, or just doing the proofs on paper with LaTeX. I chose Lean for a practical reason: it reads like code.

If you’ve trained a model in PyTorch, you can read:

def dense {m n : Nat} (W : Mat m n) (b : Vec n) (x : Vec m) : Vec n :=
  fun j => finSum m (fun i => x i * W i j) + b j

and understand it. It’s a function with typed arguments. {m n : Nat} are the dimensions. fun j => is a lambda. finSum is a for loop. You don’t need to learn a new paradigm — you need to learn maybe twenty keywords and some notation, and then Lean reads like any other typed functional language.

The other reason is that Lean enforces honesty. If I claim that the backward pass of batch normalization is a specific three-term formula, the Lean compiler requires me to prove it. If the proof compiles, the claim is correct — not because you trust me, but because the type checker verified it. There is no hidden hand-waving. Every VJP in this book — dense, convolution, batch norm, residual, attention, all of them — builds with zero sorrys. The compiler is the referee.

Let’s go

In 2018, I was on a small team that won Stanford’s DAWNBench competition — training ImageNet in 3 hours for $25 on commodity hardware. The lesson from that experience was simple: most of the apparent difficulty in modern deep learning isn’t fundamental. It’s accumulated folklore, missing tools, and a culture that takes the math on faith.

This book applies that same lesson to the math itself. The folklore says backpropagation is complicated. It isn’t. It’s three rules, five tricks, and a lot of bookkeeping. The tools to verify this claim finally exist, and you’re holding the result.

Three of my favorite citations of the first book are from high school students. I don’t think it’s an accident teenagers are publishing results. The barriers to doing real ML research used to be infrastructure — compute, data, institutional access. Those barriers have mostly fallen. What remains is the smaller barrier of understanding what you’re doing at the level where you can verify it, not just run it.

This book is for the person who wants to cross that barrier. It might be you. It might be the next kid who picks it up in a library somewhere and decides to see how far they can go.

Let’s find out.