1 MNIST: linear classifier

A VJP (vector-Jacobian product) is the backward function for a layer: takes an upstream gradient, returns the downstream one. Every backward function in this book is a VJP built by composing the foundation here — the partial-derivative function $\operatorname {pdiv}$ (defined via Mathlib’s $\operatorname {fderiv}$), its three structural rules (chain, sum, product), and three VJP record types ($\mathsf{HasVJP}$, $\mathsf{HasVJPMat}$, $\mathsf{HasVJP3}$, one per tensor rank) that bundle a backward function with its correctness claim.

This is the technically hardest chapter in the book, by design. Going from nothing to a complete trained model — the $\operatorname {pdiv}$ calculus, the VJP framework, a forward pass, a loss, a backward pass, an SGD step, all the way down to the GPU — is the entire machinery, built in a single chapter. We do it on the smallest possible network (one matrix multiply: the MNIST linear classifier) to demonstrate the pattern. Once that machinery exists, every later chapter is just adding a layer: one new primitive dropped into the same forward / loss / backward / optimize loop.

The next twelve theorems all reduce to a single definition ($\operatorname {pdiv}$) plus chain rule, sum rule, and product rule. Before we drop you into the deep end, here’s the shape of what’s coming:

$\begin{tikzpicture} [ node distance=1.2cm, every node/.style={font=\small, align=center}, group/.style={rectangle, draw, fill=blue!5, rounded corners, inner sep=4pt, minimum width=5cm, minimum height=1.5cm}, root/.style={rectangle, draw, fill=red!10, rounded corners, inner sep=4pt, minimum width=4cm, minimum height=0.8cm}, payoff/.style={rectangle, draw, fill=green!10, rounded corners, inner sep=4pt, minimum width=6.5cm, minimum height=1cm}, arrow/.style={->, >=stealth, thick, gray} ] \node[root] (pdiv) at (0, 5) {$\pdiv$ \\ \footnotesize defined via Mathlib's $\fderiv$}; \node[group] (foundation) at (0, 3) {\textbf{Foundation rules} \\ chain $\cdot$ sum $\cdot$ product \\ identity $\cdot$ const $\cdot$ reindex \\ finite-sum}; \node[group] (vjp) at (0, 0.5) {\textbf{VJP transpose} \\ \texttt{vjp\_comp} $\cdot$ biPath \\ elemwiseProduct $\cdot$ identity}; \node[payoff] (payoff) at (0, -2) {Used by the MNIST linear classifier here \\ (proven + trained); then reused by every \\ architecture in Chapters 2--9.}; \draw[arrow] (pdiv) -- (foundation); \draw[arrow] (foundation) -- (vjp); \draw[arrow] (vjp) -- (payoff); \end{tikzpicture}$

Read top-to-bottom, this is the order things get proved. Read bottom-to-top, this is what every theorem in Chapters 3–9 unfolds to. (Ch 9 attention adds matrix-level machinery on top; see §9.1.) The full clickable dependency graph is in the blueprint web view.

How the proofs are written.

The proofs in this book follow Lamport’s structured style (How to Write a 21^st Century Proof). Hypotheses are named up front in an assume:/prove: header, with the matching Lean hypothesis name in brackets. The proof is a numbered sequence of steps; each step carries its own proof naming exactly the facts it follows from; the final step is always q.e.d. — the explanation of why the steps prove the goal. In equational chains, $\mathord {@}$ stands for the expression established by the previous step. Each step mirrors one tactic block of the corresponding Lean proof, so the informal argument can be audited against the formal one step by step. One-line lemmas (pdiv_id, pdiv_const, pdiv_reindex, the identity VJP) stay unstructured: structuring a one-simp proof would be ceremony.

1.1 The theorems

Definition 1 Partial derivative

✓

The partial derivative function. For $f : \mathbb {R}^{m} \to \mathbb {R}^{n}$, $\operatorname {pdiv}\, f\, x\, i\, j$ is the $(i, j)$-entry of the Jacobian at $x$, defined as $\operatorname {fderiv}_{\mathbb {R}}\, f\, x\, (\mathbf{e}_i)\, j$ — the $j$-th coordinate of Mathlib’s Fréchet derivative applied to the $i$-th standard basis vector.

Theorem 2 Chain rule

✓

assume:

$f : \mathbb {R}^{m} \to \mathbb {R}^{n}$ is differentiable at $x$ [hf]
$g : \mathbb {R}^{n} \to \mathbb {R}^{p}$ is differentiable at $f(x)$ [hg]

prove: for all $i, k$:

\[ \operatorname {pdiv}(g \circ f)\, x\, i\, k = \sum _j \operatorname {pdiv}f\, x\, i\, j \cdot \operatorname {pdiv}g\, (f\, x)\, j\, k. \]

Proof ▶

Sketch: reduce to Mathlib’s chain rule for $\operatorname {fderiv}$, then decompose over the standard basis to turn a composite of linear maps into the sum over the middle index.

$\operatorname {pdiv}(g \circ f)\, x\, i\, k = \operatorname {fderiv}_{\mathbb {R}}\, (g \circ f)\, x\, (\mathbf{e}_i)\, k$.
proof: Definition 1.
$\operatorname {fderiv}_{\mathbb {R}}\, (g \circ f)\, x = \operatorname {fderiv}_{\mathbb {R}}\, g\, (f\, x) \circ \operatorname {fderiv}_{\mathbb {R}}\, f\, x$.
proof: Mathlib’s fderiv_comp, applicable by assumptions 1 and 2.
Let $v := \operatorname {fderiv}_{\mathbb {R}}\, f\, x\, (\mathbf{e}_i) \in \mathbb {R}^{n}$. Then $v = \sum _j v_j \cdot \mathbf{e}_j$.
proof: Pointwise: at coordinate $j'$ the sum collapses to $v_{j'}$, because $(\mathbf{e}_j)_{j'}$ is the Kronecker delta.
$\operatorname {fderiv}_{\mathbb {R}}\, g\, (f\, x)\, v\, k = \sum _j v_j \cdot \operatorname {fderiv}_{\mathbb {R}}\, g\, (f\, x)\, (\mathbf{e}_j)\, k$.
proof: By 3 and linearity of $\operatorname {fderiv}_{\mathbb {R}}\, g\, (f\, x)$ (map_sum, map_smul).
q.e.d.
proof: By Definition 1, $v_j = \operatorname {pdiv}f\, x\, i\, j$ and $\operatorname {fderiv}_{\mathbb {R}}\, g\, (f\, x)\, (\mathbf{e}_j)\, k = \operatorname {pdiv}g\, (f\, x)\, j\, k$; chaining 1, 2, and 4 gives the goal.

Theorem 3 Sum rule

✓

assume:

$f, g : \mathbb {R}^{m} \to \mathbb {R}^{n}$ are both differentiable at $x$ [hf, hg]

prove: for all $i, j$:

\[ \operatorname {pdiv}(f + g)\, x\, i\, j = \operatorname {pdiv}f\, x\, i\, j + \operatorname {pdiv}g\, x\, i\, j. \]

Proof ▶

$\operatorname {pdiv}(f + g)\, x\, i\, j = \operatorname {fderiv}_{\mathbb {R}}\, (f + g)\, x\, (\mathbf{e}_i)\, j$.
proof: Definition 1; the pointwise sum $\lambda y\, k.\; f\, y\, k + g\, y\, k$ is definitionally the function $f + g$.
$\operatorname {fderiv}_{\mathbb {R}}\, (f + g)\, x = \operatorname {fderiv}_{\mathbb {R}}\, f\, x + \operatorname {fderiv}_{\mathbb {R}}\, g\, x$.
proof: Mathlib’s fderiv_add, applicable by the assumption.
q.e.d.
proof: Evaluate 2 at $(\mathbf{e}_i, j)$; by Definition 1 the two terms are $\operatorname {pdiv}f\, x\, i\, j$ and $\operatorname {pdiv}g\, x\, i\, j$.

Theorem 4 Product rule

✓

assume:

$f, g : \mathbb {R}^{m} \to \mathbb {R}^{n}$ are both differentiable at $x$ [hf, hg]

prove: for all $i, j$:

\[ \operatorname {pdiv}(f \odot g)\, x\, i\, j = \operatorname {pdiv}f\, x\, i\, j \cdot g\, x\, j + f\, x\, j \cdot \operatorname {pdiv}g\, x\, i\, j, \]

where $(f \odot g)\, y\, k = f\, y\, k \cdot g\, y\, k$ is the elementwise product.

Proof ▶

$\operatorname {pdiv}(f \odot g)\, x\, i\, j = \operatorname {fderiv}_{\mathbb {R}}\, (f \cdot g)\, x\, (\mathbf{e}_i)\, j$.
proof: Definition 1; the elementwise product is definitionally the algebra product $f \cdot g$ in $\mathbb {R}^{n}$ (Pi.normedAlgebra), so Mathlib’s calculus of algebra-valued maps applies.
$\operatorname {fderiv}_{\mathbb {R}}\, (f \cdot g)\, x = f\, x \cdot \operatorname {fderiv}_{\mathbb {R}}\, g\, x + g\, x \cdot \operatorname {fderiv}_{\mathbb {R}}\, f\, x$ (scalar action taken pointwise).
proof: Mathlib’s fderiv_mul, applicable by the assumption.
q.e.d.
proof: Evaluate 2 at $(\mathbf{e}_i, j)$: the pointwise scalar action gives $f\, x\, j \cdot \operatorname {pdiv}g\, x\, i\, j + g\, x\, j \cdot \operatorname {pdiv}f\, x\, i\, j$ by Definition 1; commute the second summand (ring) to match the goal.

Theorem 5 Identity Jacobian

✓

$\operatorname {pdiv}(\mathrm{id})\, x\, i\, j = \delta _{ij}$.

Proof ▶

Mechanical; see Proofs.pdiv_id.

Theorem 6 Constant has zero Jacobian

✓

Proof ▶

Mechanical; see Proofs.pdiv_const.

Theorem 7 Gather / reindex Jacobian

✓

Covers permutations, reshapes, slicing. Generalizes pdiv_id.

Proof ▶

Mechanical; see Proofs.pdiv_reindex.

Theorem 8 Finite-sum rule

✓

assume:

$S$ is a finite index set with a function $f_s : \mathbb {R}^{m} \to \mathbb {R}^{n}$ for each $s \in S$
every $f_s$ is differentiable at $x$ [hdiff]

prove: for all $i, j$:

\[ \operatorname {pdiv}\Bigl(\sum _{s \in S} f_s\Bigr)\, x\, i\, j = \sum _{s \in S} \operatorname {pdiv}f_s\, x\, i\, j. \]

Proof ▶

Sketch: induction on $S$; the two-summand sum rule does each inductive step.

case $S = \emptyset $: both sides are $0$.
proof: The empty sum is the constant zero function, whose Jacobian vanishes by Theorem 6; the right side is an empty sum.
case $S = \{ a\} \cup T$ with $a \notin T$, assuming the claim holds for $T$ (induction hypothesis): $\operatorname {pdiv}\bigl(\sum _{s \in S} f_s\bigr)\, x\, i\, j = \operatorname {pdiv}f_a\, x\, i\, j + \sum _{s \in T} \operatorname {pdiv}f_s\, x\, i\, j$.
proof: Split the sum as $f_a + \sum _{s \in T} f_s$ (Finset.sum_insert, using $a \notin T$). The tail $\sum _{s \in T} f_s$ is differentiable at $x$ as a finite sum of functions differentiable by assumption 2 (DifferentiableAt.fun_sum), so the sum rule (Theorem 3) splits the Jacobian; the induction hypothesis rewrites the tail.
q.e.d.
proof: By induction on the finite set $S$ (Finset.induction_on): 1 is the base case, and re-merging the sum in 2 gives the inductive step.

Definition 9 VJP record

✓

For $f : \mathbb {R}^{m} \to \mathbb {R}^{n}$, $\mathsf{HasVJP}\, f$ bundles a backward function $B : \mathbb {R}^{m} \to \mathbb {R}^{n} \to \mathbb {R}^{m}$ with its correctness claim: for all $x$, $dy$, $i$,

\[ B(x, dy)_i = \sum _j \operatorname {pdiv}f\, x\, i\, j \cdot dy_j. \]

Exhibiting a $\mathsf{HasVJP}\, f$ is exactly the statement “this backward function computes the vector–Jacobian product of $f$.”

Theorem 10 VJP chain rule

✓

assume:

$B_f$ is a correct backward function for $f$ ($\mathsf{HasVJP}\, f$) [hf]
$B_g$ is a correct backward function for $g$ ($\mathsf{HasVJP}\, g$) [hg]
$f$ is differentiable everywhere [hf_diff]
$g$ is differentiable everywhere [hg_diff]

prove: $\mathsf{HasVJP}\, (g \circ f)$.

Proof ▶

Sketch: the composite backward is “run $B_g$, feed the result to $B_f$”; correctness is the two correct fields glued by the chain rule for $\operatorname {pdiv}$.

Define $B(x, dy) := B_f\bigl(x,\; B_g(f\, x,\; dy)\bigr)$.
suffices: for all $x$, $dy$, $i$: $B(x, dy)_i = \sum _k \operatorname {pdiv}(g \circ f)\, x\, i\, k \cdot dy_k$.
proof: Definition 9, with $B$ as the candidate backward function.
$B(x, dy)_i = \sum _j \operatorname {pdiv}f\, x\, i\, j \cdot B_g(f\, x, dy)_j$.
proof: Correctness of $B_f$ (assumption 1).
$\mathord {@}= \sum _j \operatorname {pdiv}f\, x\, i\, j \sum _k \operatorname {pdiv}g\, (f\, x)\, j\, k \cdot dy_k$.
proof: Correctness of $B_g$ (assumption 2).
$\mathord {@}= \sum _k \Bigl(\sum _j \operatorname {pdiv}f\, x\, i\, j \cdot \operatorname {pdiv}g\, (f\, x)\, j\, k\Bigr)\, dy_k$.
proof: Distribute and swap the two sums (Finset.mul_sum, Finset.sum_comm); both are finite, so no convergence question arises.
q.e.d.
proof: By the chain rule (Theorem 2), applicable at every $x$ by assumptions 3 and 4, the inner sum in 5 is $\operatorname {pdiv}(g \circ f)\, x\, i\, k$; with 2 this is the goal.

Theorem 11 Additive fan-in VJP

✓

Used for residual connections. assume:

$B_f$ is a correct backward function for $f$ ($\mathsf{HasVJP}\, f$) [hf]
$B_g$ is a correct backward function for $g$ ($\mathsf{HasVJP}\, g$) [hg]
$f$ is differentiable everywhere [hf_diff]
$g$ is differentiable everywhere [hg_diff]

prove: $\mathsf{HasVJP}\, (f + g)$.

Proof ▶

Define $B(x, dy)_i := B_f(x, dy)_i + B_g(x, dy)_i$.
suffices: for all $x$, $dy$, $i$: $B(x, dy)_i = \sum _j \operatorname {pdiv}(f + g)\, x\, i\, j \cdot dy_j$.
proof: Definition 9, with $B$ as the candidate backward function.
$B(x, dy)_i = \sum _j \bigl(\operatorname {pdiv}f\, x\, i\, j + \operatorname {pdiv}g\, x\, i\, j\bigr) \cdot dy_j$.
proof: Correctness of $B_f$ and $B_g$ (assumptions 1, 2); merge the two finite sums (Finset.sum_add_distrib) and factor out $dy_j$ (ring).
q.e.d.
proof: The sum rule (Theorem 3), applicable at every $x$ by assumptions 3 and 4, rewrites the bracket in 3 to $\operatorname {pdiv}(f + g)\, x\, i\, j$; with 2 this is the goal.

Theorem 12 Multiplicative fan-in VJP

✓

Used for Squeeze-and-Excitation. assume:

$B_f$ is a correct backward function for $f$ ($\mathsf{HasVJP}\, f$) [hf]
$B_g$ is a correct backward function for $g$ ($\mathsf{HasVJP}\, g$) [hg]
$f$ is differentiable everywhere [hf_diff]
$g$ is differentiable everywhere [hg_diff]

prove: $\mathsf{HasVJP}\, (f \odot g)$, where $(f \odot g)\, x\, i = f\, x\, i \cdot g\, x\, i$.

Proof ▶

Define $B(x, dy)_i := B_f\bigl(x,\; g(x) \odot dy\bigr)_i + B_g\bigl(x,\; f(x) \odot dy\bigr)_i$, where $\bigl(g(x) \odot dy\bigr)_j = g\, x\, j \cdot dy_j$.
suffices: for all $x$, $dy$, $i$: $B(x, dy)_i = \sum _j \operatorname {pdiv}(f \odot g)\, x\, i\, j \cdot dy_j$.
proof: Definition 9, with $B$ as the candidate backward function.
$B(x, dy)_i = \sum _j \bigl(\operatorname {pdiv}f\, x\, i\, j \cdot g\, x\, j + f\, x\, j \cdot \operatorname {pdiv}g\, x\, i\, j\bigr) \cdot dy_j$.
proof: Correctness of $B_f$ at upstream gradient $g(x) \odot dy$ and of $B_g$ at $f(x) \odot dy$ (assumptions 1, 2); merge the two finite sums and factor out $dy_j$ (Finset.sum_add_distrib, ring).
q.e.d.
proof: The product rule (Theorem 4), applicable at every $x$ by assumptions 3 and 4, identifies the bracket in 3 with $\operatorname {pdiv}(f \odot g)\, x\, i\, j$; with 2 this is the goal.

Theorem 13 Identity VJP

✓

Proof ▶

Mechanical; see Proofs.identity_has_vjp.

1.2 Example: MNIST linear classifier

The smallest network that learns: a single dense layer mapping 784-dim images to 10-dim logits, trained with softmax cross-entropy and plain SGD. About 7,850 parameters. $\sim $92% test accuracy in seconds.

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.62cm, minimum width=4.2cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, ] % Vertical layer column: the whole "network" is one dense layer. \node[io] (input) {Input \;\; $28\times28$ flatten $\to 784$}; \node[head, below=0.24cm of input] (d1) {\textbf{Dense} $784\to10$ \;(identity)}; \node[logits, below=0.24cm of d1] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/d1, d1/out} \draw[arr] (\a) -- (\b); \end{tikzpicture}$

The forward pass and loss, in math.

For an input image $x \in \mathbb {R}^{784}$ and true class index $t \in \{ 0, \dots , 9\} $:

\begin{align*} z & = W x + b & & \text{(dense layer; $W \in \mathbb {R}^{10 \times 784}$, $b \in \mathbb {R}^{10}$)} \\ \hat{y}_i & = \mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum _j e^{z_j}} & & \text{(class probabilities)} \\ \ell (W, b;\, x, t) & = -\log \hat{y}_t & & \text{(cross-entropy on the true class)} \end{align*}

The trainable parameters are $W$ and $b$ — 7,850 floats total.

The same thing, in NetSpec.

def mnistLinear : NetSpec where
  name   := "MNIST-Linear"
  imageH := 28
  imageW := 28
  layers := [.dense 784 10 .identity]

The .identity activation means the layer outputs raw logits $z = W x + b$; softmax + cross-entropy are added by .train since this is a classification task. The four-line NetSpec above describes exactly the function on the previous page.

Training program and run command.

def main (args : List String) : IO Unit :=
  mnistLinear.train mnistLinearConfig (args.head?.getD "data") .mnist

In our repo this is the standalone trainer mnist-linear-train (MainMnistLinearTrain.lean), matching the per-chapter trainer pattern every later chapter uses. Build and run it with

lake build mnist-linear-train
./.lake/build/bin/mnist-linear-train

The gradient, in math.

Backpropagation through this network produces a single outer product:

\[ \nabla _W \ell = (\hat{y} - e_t) \otimes x, \qquad \nabla _b \ell = \hat{y} - e_t \]

where $e_t \in \mathbb {R}^{10}$ is the one-hot vector for the true class. Ch 1’s theorems (chain rule + identity Jacobian/VJP) plus the Dense Jacobian and softmax-CE gradient (formalized in Ch 2, both themselves proved from this chapter’s foundation rules) are what guarantees these formulas are correct; linear-sgd’s codegen emits exactly these as fused MLIR.

Results.

Captured verbatim from runs/2026-05-05-linear-sgd/linear-sgd.log on an AMD 7900 XTX (ROCm, gfx1100):

$ ./.lake/build/bin/mnist-linear-train
MNIST-Linear: 7850 params
training: 468 batches/epoch, batch=128, SGD, lr=0.100000
  step 0/468: loss=2.301343 (11ms)
Epoch  1/12: loss=0.381019 lr=0.100000 (932ms)
Epoch  2/12: loss=0.300518 lr=0.100000 (969ms)
Epoch  3/12: loss=0.289204 lr=0.100000 (934ms)
Epoch  4/12: loss=0.282829 lr=0.100000 (964ms)
Epoch  5/12: loss=0.277735 lr=0.100000 (995ms)
Epoch  6/12: loss=0.273606 lr=0.100000 (977ms)
Epoch  7/12: loss=0.271623 lr=0.100000 (961ms)
Epoch  8/12: loss=0.268908 lr=0.100000 (930ms)
Epoch  9/12: loss=0.265967 lr=0.100000 (969ms)
Epoch 10/12: loss=0.265967 lr=0.100000 (969ms)
  val accuracy: 9219/9984 = 92.34%
Epoch 11/12: loss=0.265719 lr=0.100000 (949ms)
Epoch 12/12: loss=0.263504 lr=0.100000 (967ms)
  val accuracy: 9175/9984 = 91.90%
Saved params.

Twelve epochs, about 1 second per epoch, $\sim $12 seconds total. Loss drops from $\log 10 \approx 2.30$ at step 0 (random init for 10 classes) to $0.26$ by epoch 12. Final test accuracy 91.90%, which is the going rate for a linear classifier on MNIST (LeCun’s original 1998 benchmark put a comparable model at $\sim $92%). Adding hidden layers and ReLU (Ch 2) bumps this to 98.57% on the same dataset and the same recipe; that delta of $\sim $6.7 points is the value of non-linearity on this task.

Where this goes.

Every architecture in Part 2 extends this template the same way. Ch 2 stacks dense layers and inserts ReLU between them (one new operator VJP); Ch 3 swaps the dense forward for conv2d (one new operator VJP); Ch 4 adds BatchNorm (one new operator VJP); Ch 5 adds the residual skip (additive fan-in, Theorem 11, no new operator); Ch 6 factors standard conv into depthwise plus pointwise (one new operator VJP, recombined through the chain rule); Ch 7 adds the SE channel-attention block (elementwise product plus GAP plus dense, one new VJP composed from pieces we already have); Ch 8 replaces ReLU with GELU (one new operator); Ch 9 adds attention (new matrix-level machinery plus softmax-VJP chains). The structural rules from this chapter never change; later chapters add operator-specific theorems and compose them through the chain rule we just proved. The training loop that runs all of them is the same hundred lines of Lean, walked through next.

1.3 MLIR: Linear

The backward pass of the linear classifier is one operation: with the loss cotangent $dy = \mathrm{softmax}(\mathrm{logits}) - \text{onehot}$, the input gradient is $dx = dy\, W^{\! \top }$, a single matrix multiply. Here is exactly what the code generator emits for it (at a small representative size; nothing hand-written, one stablehlo op):

func.func @linear_back(%dy: tensor<2x3xf32>, %W0: tensor<4x3xf32>)
    -> tensor<2x4xf32> {
  %bk0 = stablehlo.dot_general %dy, %W0, contracting_dims = [1] x [1]
           : (tensor<2x3xf32>, tensor<4x3xf32>) -> tensor<2x4xf32>
  return %bk0 : tensor<2x4xf32>
}

That dot_general, contracting the output axis of $dy$ against the output axis of $W$, is multiplication by $W^{\! \top }$ — which is, by a machine-checked theorem, the linear layer’s exact reverse-mode derivative. Everything else in this book scales that one move up. Each architecture chapter closes with an “MLIR: operator” section that takes the chapter’s new operator — convolution, BatchNorm, the residual fan-in, depthwise convolution, squeeze-excitation, layer scale, attention — and shows that its emitted backward is likewise the rendering of a proven derivative. Each of those listings is shown at a small representative size — a few channels, tokens, or units of width. The per-operator proof beneath is dimension-parameterized and the printer emits the identical graph at production scale, so each section just states its toy shape and moves on. The shared machinery underneath them — the denoted intermediate representation, the bridge theorems that tie it to the proofs, the printer that turns it into the text above, and the execution oracle that checks the result on the GPU — is the subject of Appendix C.

1.4 What’s inside .train?

The section above treated mnistLinear.train as a black box. You specified the network, you specified the hyperparameters, training happened. That’s deliberately the user-facing interface, but there’s no magic: the training loop is real code, $\sim $100 lines of Lean in LeanMlir/Train.lean. Every chapter in this book uses the same loop. The only thing that varies chapter to chapter is the NetSpec and TrainConfig values handed to it; never the loop itself.

The training loop, the way a math book would write it.

Algorithm: Mini-batch SGD with Adam optimizer
Input:  spec (architecture f_θ), cfg (hyperparameters),
        dataset D = (X_train, y_train, X_val, y_val)
Output: trained parameters θ

// 1. Load training data
(X_train, y_train) ← load(D)
B ← cfg.batchSize, E ← cfg.epochs, α ← cfg.learningRate

// 2. Initialize parameters + Adam moment buffers
θ ← spec.heInit()
m, v ← 0                                  // Adam 1st and 2nd moment

// 3. Epoch loop: shuffle, schedule LR, train, log
for epoch = 1 to E:
    (X, y) ← shuffle(X_train, y_train)
    α_t ← schedule(α, epoch)              // cosine + warmup

    // 4. Batch loop: forward + loss + backward + optimizer
    for each mini-batch (x, t) of size B:
        ŷ ← f_θ(x)                         // forward
        L ← ℓ(ŷ, t)                        // loss
        g ← ∇_θ L                          // backward — Ch 1's VJP machinery
        (θ, m, v) ← Adam(θ, m, v, g, α_t)  // optimizer step
    log(epoch, mean_loss)

    // 5. Validation every 10 epochs
    if epoch ≡ 0 mod 10:
        eval(θ, X_val, y_val)

// 6. Save trained parameters
save(θ)
return θ

The only line that needs proof machinery is $g \leftarrow \nabla _\theta L$ — that’s exactly what Ch 1’s theorems just established. Everything else is data plumbing (loaders, epoch counters, the SGD/Adam update). For mnistLinear above, that gradient is a single outer product; for ResNet-34 it would be the chain rule applied through 34 layers; the loop is the same in either case. Each later chapter swaps in a different $f_\theta $; the loop never changes.

The same loop, in Lean.

The Lean realization of the algorithm above is one hundred lines in LeanMlir/Train.lean. The // 1 through // 6 comments below match the same labels in the pseudocode — read them in parallel.

def runTraining (spec : NetSpec) (cfg : TrainConfig)
    (ds : DatasetKind) (dataDir : String)
    (sess : IreeSession) : IO Unit := do
  -- 1. Load training data
  let batchN := cfg.batchSize
  let dio    := datasetIO ds
  let (trainImg, trainLbl, nTrain) ← dio.loadTrain dataDir

  -- 2. Initialize parameters + Adam moment buffers
  let mut p ← spec.heInitParams                        -- He-init weights
  let mut m ← F32.const (F32.size p).toUSize 0.0        -- Adam 1st moment
  let mut v ← F32.const (F32.size p).toUSize 0.0        -- Adam 2nd moment

  let bpE := nTrain / batchN
  let nP  := spec.totalParams
  let mut globalStep : Nat := 0

  -- 3. Epoch loop: shuffle, schedule LR, train, log
  for epoch in [:cfg.epochs] do
    let (sImg, sLbl) ← F32.shuffle trainImg trainLbl
                         nTrain.toUSize dio.trainPixels.toUSize
                         (epoch + 42).toUSize

    let lr : Float :=                                   -- cosine + warmup
      if epoch < cfg.warmupEpochs then
        cfg.learningRate * (epoch.toFloat + 1.0)
          / cfg.warmupEpochs.toFloat
      else if cfg.cosineDecay then
        cfg.learningRate * 0.5 * (1.0 + Float.cos (
          3.14159265 * (epoch.toFloat - cfg.warmupEpochs.toFloat)
                     / (cfg.epochs.toFloat - cfg.warmupEpochs.toFloat)))
      else cfg.learningRate

    -- 4. Batch loop: forward + loss + backward + optimizer in ONE call
    let mut epochLoss : Float := 0.0
    for bi in [:bpE] do
      globalStep := globalStep + 1
      let xba := F32.sliceImages sImg (bi * batchN) batchN dio.trainPixels
      let yb  := F32.sliceLabels sLbl (bi * batchN) batchN
      let packed := (p.append m).append v

      let out ← IreeSession.trainStepAdamF32 sess spec.trainFnName
                  packed spec.shapesBA xba (spec.xShape batchN) yb
                  lr globalStep.toFloat spec.bnShapesBA batchN.toUSize

      epochLoss := epochLoss + F32.extractLoss out (3 * nP)
      p := F32.slice out 0           nP                 -- updated params
      m := F32.slice out nP          nP                 -- updated m
      v := F32.slice out (2 * nP)    nP                 -- updated v

    IO.eprintln s!"Epoch {epoch+1}/{cfg.epochs}: " ++
                s!"loss={epochLoss / bpE.toFloat} lr={lr}"

    -- 5. Validation every 10 epochs: forward-only vmfb over val set
    if (epoch + 1) % 10 == 0 || epoch + 1 == cfg.epochs then
      let evalSess ← IreeSession.create
                       s!"{spec.buildPrefix}_fwd_eval.vmfb"
      let (valImg, valLbl, nVal) ← dio.loadVal dataDir
      let mut correct : Nat := 0
      for bi in [:nVal / batchN] do
        let xba := F32.sliceImages valImg (bi * batchN) batchN dio.valPixels
        let logits ← IreeSession.forwardF32 evalSess spec.evalFnName
                       p spec.evalShapesBA xba (spec.xShape batchN)
                       batchN.toUSize spec.numClasses.toUSize
        for i in [:batchN] do
          let pred  := F32.argmax10 logits (i * spec.numClasses).toUSize
          let label := (F32.sliceLabels valLbl (bi * batchN) batchN)
                         .data[i * 4]!.toNat
          if pred.toNat == label then correct := correct + 1
      let acc := correct.toFloat / nVal.toFloat * 100.0
      IO.eprintln s!"  val accuracy: {correct}/{nVal} = {acc}%"

  -- 6. Save trained parameters
  IO.FS.writeBinFile s!"{spec.buildPrefix}_params.bin" p
  IO.eprintln "Saved params."

Walking through the numbered sections:

1. Load training data. datasetIO ds returns a per-dataset I/O helper (MNIST, CIFAR-10, Imagenette) that knows how to mmap the on-disk binary files into F32Array buffers. trainImg holds the flattened pixel data; trainLbl holds integer class labels. Nothing ML-specific yet — this is just “put the data somewhere the GPU can reach.”

2. Initialize parameters and optimizer state. He initialization (spec.heInitParams) produces a random-but-scaled weight vector. Adam’s first and second moment buffers start at zero. Everything is a flat F32Array; reshape logic happens per-layer inside the compiled vmfb, not here. Three buffers — p, m, v — plus a running step counter.

3. Epoch loop. Shuffle the data once per epoch with a deterministic seed (epoch + 42) so runs are reproducible. Compute the learning rate: linear warmup for the first warmupEpochs, then cosine decay (if enabled) over the rest of training. If neither warmup nor cosine is on — the s4tfBaseline case we used in the first example — lr is just cfg.learningRate constant. Warmup handles the transformer-style fragile-early-gradient problem; cosine is the standard don’t-overfit-at-the-end schedule.

4. Batch loop — the core of training. For each mini-batch: slice the current epoch’s images and labels, concatenate (params, m, v) into one flat buffer, and call IreeSession.trainStepAdamF32. That one call does everything: forward pass through every layer, cross-entropy loss, backward pass (the VJP of every layer you’ve proved in earlier sections), Adam update of parameters and moment buffers. Whether that update is Adam or plain SGD + momentum is baked into the vmfb at codegen time by cfg.useAdam (step 1); the call name stays trainStepAdamF32 either way. All of it executes as one pre-compiled stablehlo vmfb on the GPU. The framework unpacks the updated (p, m, v) from the output buffer and loops.

The entire Lean $\to $ MLIR $\to $ IREE pipeline’s value proposition lives in this one line. There is no Python. There is no graph construction per-step. There is no autograd interpreter. The training step was compiled once, at startup, and every subsequent step is a single dispatched vmfb call. That’s why training at batch 128 runs at $\sim $20 ms per step on a consumer GPU (7900 XTX), $\sim $80 ms on a 4060 Ti.

5. Validation. Every 10 epochs (and always at the end) we swap in the forward-only eval vmfb, which uses BN’s running statistics (not the per-batch estimates training uses). Loop over the val set, forward-pass each batch, argmax the logits, compare to labels, count correct. Print accuracy. The eval vmfb is a separate compiled artifact because BN behaves differently at inference and we don’t want that branching inside the hot training loop.

6. Save. Write the parameter buffer to disk as a raw .bin blob so it can be loaded later for inference, fine-tuning, or cross-run comparison.

That’s the whole function. Seventy-odd lines of Lean doing what most Python deep-learning tutorials present as an elaborate black box. The reason it’s short is that everything heavyweight — forward pass, loss, backward pass, optimizer update — is one compiled vmfb executing on the GPU. Lean’s role here isn’t to compute gradients; it’s to specify what the training step is, prove that specification is mathematically correct (the theorems earlier in this chapter and in subsequent chapters), and invoke the IREE-compiled implementation. Three separate concerns, cleanly factored.

Every chapter after this one hands a different NetSpec to this same loop. Only the NetSpec changes; the loop is literally the same hundred lines each time. That’s the framework pitch in concrete form: once the loop is written (and proved correct one layer at a time), every architecture in the book and every architecture in the bestiary runs through it with no further infrastructure code. And it all runs on your hardware: Appendix B walks you through the toolchain setup, and if you’re impatient, there’s a Docker image that gets you from zero to training MNIST in one command.

1.5 MLIR: Training Step

Section 1.4 made a claim about the inner call: one dispatched stablehlo vmfb does the forward pass, the loss, the backward pass, and the optimizer update — no tape, no per-step graph construction. The “MLIR: Linear” section above rendered a single operator’s backward in isolation; here is the whole thing the linear classifier compiles to. It is short enough to show essentially in full — only the default precision attributes are dropped and the long type signatures wrapped for the page:

func.func @linear_train_step(
    %x: tensor<128x784xf32>,  %W0: tensor<784x10xf32>,
    %b0: tensor<10xf32>,      %onehot: tensor<128x10xf32>)
    -> (tensor<784x10xf32>, tensor<10xf32>) {
  // -- forward + softmax-CE cotangent (rendered from lossCotGraph) --
  %v0 = stablehlo.dot_general %x, %W0, contracting_dims = [1] x [0]
          : (tensor<128x784xf32>, tensor<784x10xf32>) -> tensor<128x10xf32>
  %v1 = stablehlo.broadcast_in_dim %b0, dims = [1]
          : (tensor<10xf32>) -> tensor<128x10xf32>
  %v2 = stablehlo.add %v0, %v1 : tensor<128x10xf32>          // logits
  %v3 = stablehlo.exponential %v2 : tensor<128x10xf32>
  %v4 = stablehlo.constant dense<0.0> : tensor<f32>
  %v5 = stablehlo.reduce(%v3 init: %v4)
          applies stablehlo.add across dimensions = [1]
          : (tensor<128x10xf32>, tensor<f32>) -> tensor<128xf32>
  %v6 = stablehlo.broadcast_in_dim %v5, dims = [0]
          : (tensor<128xf32>) -> tensor<128x10xf32>
  %v7 = stablehlo.divide %v3, %v6 : tensor<128x10xf32>       // softmax
  %v8 = stablehlo.subtract %v7, %onehot : tensor<128x10xf32> // dy
  // -- param grads:  dW0 = x^T . dy,  db0 = sum_batch dy --
  %sc  = stablehlo.constant dense<0.0> : tensor<f32>
  %dW0 = stablehlo.dot_general %x, %v8, contracting_dims = [0] x [0]
           : (tensor<128x784xf32>, tensor<128x10xf32>) -> tensor<784x10xf32>
  %db0 = stablehlo.reduce(%v8 init: %sc)
           applies stablehlo.add across dimensions = [0]
           : (tensor<128x10xf32>, tensor<f32>) -> tensor<10xf32>
  // -- SGD update:  theta' = theta - lr*grad  (0.00078125 = 0.1 / 128) --
  %lW0 = stablehlo.constant dense<0.00078125> : tensor<784x10xf32>
  %sW0 = stablehlo.multiply %dW0, %lW0 : tensor<784x10xf32>
  %W0n = stablehlo.subtract %W0, %sW0 : tensor<784x10xf32>
  %lb0 = stablehlo.constant dense<0.00078125> : tensor<10xf32>
  %sb0 = stablehlo.multiply %db0, %lb0 : tensor<10xf32>
  %b0n = stablehlo.subtract %b0, %sb0 : tensor<10xf32>
  return %W0n, %b0n : tensor<784x10xf32>, tensor<10xf32>
}

One function, one straight line of dataflow, no control flow: the signature takes the parameters and a batch of data and returns the updated parameters, $(\theta ,\, \text{data}) \mapsto \theta '$. Read it in four bands. The first (%v0–%v8) fuses the forward pass and the loss cotangent — dot_general for the logits, the exp/reduce/divide softmax, and the subtract %onehot that yields $dy = \partial \mathrm{CE}/\partial \text{logits}$; that whole band is the rendering of a single verified graph, lossCotGraph, certified equal to the softmax–cross-entropy gradient by lossCotGraph_isCEgrad. The second band is the backward pass — the dot_general %x, %v8 is exactly the move from the “MLIR: Linear” section, $x^{\! \top } dy$, now serving as the weight gradient, and the reduce sums $dy$ over the batch for the bias gradient. The third band is the optimizer: scale each gradient by the learning rate and subtract. There is no fourth band — no copy-back, no separate update kernel. Forward, loss, backward, and step are one compiled function, which is exactly why § 1.4’s loop can treat a training step as a single dispatch.

Two honest notes. This is SGD, not Adam. The mnistLinear example runs the s4tfBaseline configuration — plain $\theta ' = \theta - \alpha \nabla $ at a constant learning rate — so the update band is three ops per tensor. Turning on cfg.useAdam grows that band (the first- and second-moment buffers enter and leave the signature and an rsqrt appears) but changes nothing above it; the deeper chapters’ steps carry that Adam tail, the early ones do not. The learning rate is folded in. The weight-gradient dot_general contracts the batch axis, so $dW_0$ is a batch sum, not a mean; the constant 0.00078125 is $\alpha /N = 0.1/128$, the per-sample rate that turns that sum back into the mean update. Everything else in the listing is, line for line, the rendering of a theorem proved earlier in this chapter.

That one function is the only heavy thing in the loop that drives it. Stripped of logging, the shuffle, and the schedule, the whole of § 1.4’s runTraining is six lines:

p := heInit(spec)                      // params (+ zeroed Adam buffers)
for epoch in 0..E:
  for batch in shuffle(data):
    p := train_step_vmfb(p, batch, lr) // <- the function above
  if epoch % 10 == 0: eval(p)          // forward-only vmfb
save(p)

Six lines of host control flow around one compiled dispatch. The loop shuffles, schedules the learning rate, and counts epochs — it never computes a gradient or touches a tensor. The lone train_step_vmfb call is the function above, and it is the only place arithmetic happens.

Where this stands. The scheme above covers more than the linear classifier: the shallow chapters’ train steps (MLP, CNN, CIFAR$\pm $BN) are closed both ways, and all five Part I architectures — ResNet-34, MobileNetV2, EfficientNet-B0, ConvNeXt-T, ViT — are proved at full architecture: machine-checked forward graphs throughout; whole-network backward at full depth, with the non-triviality seal (that the proven backward is nonzero at a witness) at full depth for all five; three classical axioms. All five committed training steps were then re-rendered at the production trainers’ exact signatures and tied: every parameter update is proved to consume the cotangent its own backward pass delivers, threaded through the real forward, and each agrees on GPU to float rounding. The one open gap: the theorems are over $\mathbb {R}$, the GPU runs float32. Part II’s bestiary is deliberately lighter — decomposed into the same proved primitives, not carried to whole-step closes. The book’s official claim is exactly this, no more and no less: every architecture in Part I trained end-to-end under one machine-checked scheme.