2 MNIST: 1D MLP

The pdiv we built last chapter, now applied

In Chapter 1 we defined a function $\operatorname {pdiv}$ that captures the partial derivative of any sufficiently smooth function. We proved that three structural rules—chain, sum, product—suffice to compose new partials out of old ones. That was machinery without a target. This chapter picks the target: we’re going to compute the partial derivative of every component of a small image classifier, end-to-end, and the goal is for that classifier to come out the other side able to recognize handwritten digits.

To say the same thing more concretely: a neural network is a long chain of functions, and “training” means using the partials of those functions to nudge their parameters in a direction that reduces a loss. Every theorem in this chapter is the answer to the same question, asked about a different building block: if I jiggle this input by $\varepsilon $, how does the output jiggle?

The Jacobian as multidimensional “how does it jiggle?”

For a scalar function $f : \mathbb {R} \to \mathbb {R}$ the answer is a single number: the derivative. For our networks, no function is scalar-in scalar-out. The smallest layer maps an input vector to an output vector—$n$ numbers in, $m$ numbers out. “Jiggle the input” now means: pick a direction in $\mathbb {R}^n$ and push slightly along it. “The output jiggles” now means: a vector in $\mathbb {R}^m$. The full picture of how every output coordinate responds to every input coordinate is an $m \times n$ table of numbers, one slope per (output, input) pair. That table is the Jacobian. There is nothing more to it. It is just the multidimensional generalization of “the derivative is the slope.”

The previous chapter’s $\operatorname {pdiv}$ is the one entry of this table at a chosen $(i, j)$. The Jacobian is $\operatorname {pdiv}$ applied across both indices at once, organized so it can be multiplied into other matrices later.

The dense layer’s Jacobian is the weight matrix

Our smallest building block is a dense (or “fully connected”) layer: $y = Wx + b$, where $W$ is an $m \times n$ weight matrix, $x$ is an $n$-vector, and $b$ is an $m$-vector. Pick any output coordinate $y_j$ and write it out:

\[ y_j = \sum _{k=1}^{n} W_{jk}\, x_k + b_j. \]

Now jiggle $x_i$ by $\varepsilon $. Of the $n$ terms in the sum, only the $k = i$ term notices, and it changes by $W_{ji}\, \varepsilon $. So $\partial y_j / \partial x_i = W_{ji}$. The Jacobian of $y = Wx + b$ with respect to $x$ is the weight matrix $W$ itself. No new structure, no surprise—the Jacobian of a linear map is the linear map, written as a matrix.

We will do the same exercise for the partial with respect to $W$, and we will get an analogous answer: the dependence is local, the entries of the Jacobian are just $x_{i'}\, \delta _{jj'}$. These two—input-Jacobian and weight-Jacobian—are the only objects we need for a dense layer. Theorems 14 and 15 below formalize them; we have already done the substantive work.

ReLU: the piecewise case

ReLU is the function $\mathrm{relu}(x) = \max (x, 0)$ applied coordinatewise. Its Jacobian is a diagonal matrix: each output coordinate depends only on the matching input coordinate. The diagonal entry is $1$ where $x_i {\gt} 0$ and $0$ where $x_i {\lt} 0$. At $x_i = 0$ the function is not differentiable in the classical sense—the slope jumps from $0$ to $1$. For our purposes this matters less than you might fear. Theorem 16 states the Jacobian at smooth points; the codegen substitutes the standard subgradient convention at the kink and we move on. The kink is the only place in the chapter where “differentiable” becomes slightly subtle.

Softmax cross-entropy: where vectors collapse to a scalar

The last building block is the loss: softmax cross-entropy between the model’s output $z \in \mathbb {R}^{10}$ and the true class label $y \in \{ 0, \ldots , 9\} $. The full computation is

\[ L(z, y) = -\log \frac{e^{z_y}}{\sum _k e^{z_k}}. \]

The loss is a scalar, so its Jacobian with respect to $z$ is a vector, not a matrix. Working through the algebra (chain rule on $-\log \circ \mathrm{softmax}$ at the label index) produces a remarkably clean answer:

\[ \frac{\partial L}{\partial z} = \mathrm{softmax}(z) - \mathrm{onehot}(y). \]

The gradient of the loss is the difference between the model’s predicted distribution and the truth. Theorem 17 makes this formal.

From Jacobians to VJPs

Training does not multiply Jacobians together directly. The loss is a scalar, and the quantity we actually want is “how does each parameter affect that scalar?” That quantity is the Jacobian of the loss with respect to the parameters transposed and applied to the upstream gradient—a vector-Jacobian product, or VJP. Concretely: the upstream gradient is a vector $dy$, and we want to compute the corresponding $dx$ and $dW$. For a dense layer the answers fall out by transposing the picture we already have:

\[ dx = W^\top dy, \qquad dW = dy \otimes x, \qquad db = dy. \]

The remaining theorems in this chapter (18 through 22) are the formalizations of these identities, plus the proof that VJPs of composed layers are themselves VJPs obtained by composing the building blocks in reverse order. That last claim is what makes a 3-layer MLP’s backward pass exactly three transposed matrix multiplies, and it is the structural fact that the rest of the book leans on.

2.1 The theorems

Theorem 14 Dense Jacobian wrt input

✓

For the dense layer $\mathrm{dense}(W, b)\, x = \lambda j.\; \bigl(\sum _i x_i\, W_{ij}\bigr) + b_j$:

\[ \operatorname {pdiv}\bigl(\mathrm{dense}(W, b)\bigr)\, x\, i\, j = W_{ij}. \]

No hypotheses: dense is affine, so the differentiability obligations are discharged inside the proof rather than assumed.

Proof ▶

Sketch: factor the layer as (finite sum of bilinear summands) + constant, distribute $\operatorname {pdiv}$, apply the product rule per summand, collapse the Kronecker $\delta $. Every foundation rule from Chapter 1 except the chain rule fires exactly once.

$\operatorname {pdiv}\bigl(\mathrm{dense}(W, b)\bigr)\, x\, i\, j = \operatorname {pdiv}\bigl(\lambda y\, j'.\, \textstyle \sum _{i'} y_{i'} W_{i'j'}\bigr)\, x\, i\, j$.
proof: Split the layer as $\bigl(\sum _{i'} \cdots \bigr) + (\text{const } b)$ and apply the sum rule (Theorem 3) and the constant rule (Theorem 6). Their differentiability hypotheses hold because each summand $y \mapsto y_{i'} \cdot W_{i'j'}$ is (reindex) $\times $ (constant) — differentiable since reindexing is a continuous linear map — and the finite sum inherits differentiability (DifferentiableAt.fun_sum).
$\mathord {@}= \sum _{i'} \operatorname {pdiv}\bigl(\lambda y\, j'.\, y_{i'} \cdot W_{i'j'}\bigr)\, x\, i\, j$.
proof: Finite-sum rule (Theorem 8), with the same per-summand differentiability.
For each $i'$: $\operatorname {pdiv}\bigl(\lambda y\, j'.\, y_{i'} \cdot W_{i'j'}\bigr)\, x\, i\, j = \delta _{i i'} \cdot W_{i'j}$.
proof: Product rule (Theorem 4) on (reindex) $\times $ (constant); the reindex Jacobian (Theorem 7) with $\sigma = \lambda \_ .\, i'$ contributes the $\delta $, the constant factor’s Jacobian vanishes (Theorem 6); case split on $i = i'$.
q.e.d.
proof: Substitute 3 into 2: $\sum _{i'} \delta _{i i'} W_{i'j} = W_{ij}$ (Finset.sum_ite_eq); with 1 this is the goal.

Theorem 15 Dense Jacobian wrt weight

✓

Phase 7; the symmetric counterpart of Theorem 14, differentiating in $W$ instead of $x$. Since $\operatorname {pdiv}$ works on vectors, view the layer as a function of the flattened weights: for $v \in \mathbb {R}^{m \cdot n}$, let $F(v) := \mathrm{dense}(\mathrm{unflatten}\, v,\, b)\, x$, and write $\varphi $ for the index bijection $(i, j) \leftrightarrow \varphi (i, j)$ (finProdFinEquiv). prove: for all $i, j', j$:

\[ \operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j')\, j = \delta _{jj'}\, x_i. \]

Proof ▶

Sketch: same skeleton as Theorem 14 — split, distribute, product rule, collapse — with the reindex step now going through the flatten bijection.

$F(v)_{j_o} = \bigl(\sum _{i'} x_{i'} \cdot v_{\varphi (i', j_o)}\bigr) + b_{j_o}$.
proof: Unfold $\mathrm{dense}$ and $\mathrm{unflatten}$.
$\operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j')\, j = \operatorname {pdiv}\bigl(\lambda w\, j_o.\, \textstyle \sum _{i'} x_{i'} \cdot w_{\varphi (i', j_o)}\bigr)\, (\mathrm{flatten}\, W)\, \varphi (i, j')\, j$.
proof: Drop the constant bias: sum rule (Theorem 3) and constant rule (Theorem 6); each summand is (constant) $\times $ (reindex), differentiable, and the finite sum inherits differentiability (DifferentiableAt.fun_sum).
$\mathord {@}= \sum _{i'} \operatorname {pdiv}\bigl(\lambda w\, j_o.\, x_{i'} \cdot w_{\varphi (i', j_o)}\bigr)\, (\mathrm{flatten}\, W)\, \varphi (i, j')\, j$.
proof: Finite-sum rule (Theorem 8).
For each $i'$: the summand equals $\text{if } i = i' \wedge j' = j \text{ then } x_i \text{ else } 0$.
proof: Product rule (Theorem 4) on (constant $x_{i'}$) $\times $ (reindex $\sigma = \lambda j_o.\, \varphi (i', j_o)$); the constant factor’s Jacobian vanishes (Theorem 6), and the reindex Jacobian (Theorem 7) is $1$ exactly when $\varphi (i, j') = \varphi (i', j)$, which by injectivity of $\varphi $ is $i = i' \wedge j' = j$.
q.e.d.
proof: Sum 4 over $i'$: if $j' = j$ the Kronecker condition picks the single term $x_i$ (Finset.sum_ite_eq); if $j' \neq j$ every term is $0$. Both match $\delta _{jj'}\, x_i$; with 2 and 3 this is the goal.

Theorem 16 ReLU Jacobian (guarded subgradient)

✓

assume:

$x$ is a smooth point of $\mathrm{ReLU}$: every coordinate $x_k \neq 0$ [h_smooth]

prove: for all $i, j$:

\[ \operatorname {pdiv}(\mathrm{ReLU})\, x\, i\, j = \delta _{ij} \cdot [x_i {\gt} 0], \]

where $[P]$ is the Iverson bracket ($1$ if $P$ holds, else $0$).

Proof ▶

Sketch: near a smooth point, ReLU is a fixed linear map (each coordinate is committed to its branch of the $\max $); compute that map’s derivative and transport it.

Define $\Lambda _x := \Pi _k\, (\text{if } x_k {\gt} 0 \text{ then } \mathrm{proj}_k \text{ else } 0)$, the diagonal indicator CLM (reluLinearPart).
Let $r := \min _k |x_k|$. Then $r {\gt} 0$, and $\mathrm{ReLU}$ agrees with $\Lambda _x$ on the ball $B(x, r)$.
proof: $r {\gt} 0$ by assumption 1. For $y \in B(x, r)$ and every $k$: $|y_k - x_k| \le \lVert y - x \rVert {\lt} r \le |x_k|$, so $y_k$ has the sign of $x_k$; both functions then return $y_k$ where $x_k {\gt} 0$ and $0$ where $x_k {\lt} 0$.
$\mathrm{ReLU}$ has Fréchet derivative $\Lambda _x$ at $x$.
proof: A continuous linear map is its own derivative; by 2 the two functions agree on a neighborhood of $x$, and HasFDerivAt.congr_of_eventuallyEq transports the derivative across that agreement.
q.e.d.
proof: By Definition 1 and 3, $\operatorname {pdiv}(\mathrm{ReLU})\, x\, i\, j = \Lambda _x(\mathbf{e}_i)_j = [x_j {\gt} 0] \cdot (\mathbf{e}_i)_j$; case split on $i = j$ (when they coincide, $[x_j {\gt} 0] = [x_i {\gt} 0]$) gives the goal.

Theorem 17 Softmax cross-entropy gradient

✓

Write $p := \mathrm{softmax}(z)$ and $\mathrm{CE}(z, \ell ) := -\log p_\ell $ (viewed as $\mathbb {R}^{1}$-valued so $\operatorname {pdiv}$ applies; we read off the only output coordinate). prove: for all $j$:

\[ \operatorname {pdiv}\bigl(\mathrm{CE}(\cdot , \ell )\bigr)\, z\, j\, 0 = p_j - \mathrm{onehot}(\ell )_j. \]

Proof ▶

Sketch: chain rule on $-\log \circ (z \mapsto p_\ell )$, with the softmax Jacobian (proved in Ch 9) supplying the inner derivative; the $1/p_\ell $ from $\log $ cancels the $p_\ell $ the Jacobian produces.

$p_\ell {\gt} 0$, in particular $p_\ell \neq 0$.
proof: $p_\ell $ is a positive exponential over a positive finite sum of exponentials.
$\operatorname {pdiv}\bigl(\mathrm{CE}(\cdot , \ell )\bigr)\, z\, j\, 0 = \operatorname {fderiv}_{\mathbb {R}}\, \bigl(z' \mapsto \mathrm{CE}(z', \ell )\bigr)\, z\, (\mathbf{e}_j)$.
proof: Definition 1; the $\mathbb {R}^{1}$ wrapper just evaluates the single output coordinate (fderiv_apply), legitimate because the wrapper is differentiable — $\mathrm{softmax}$ is differentiable and $\log $ is differentiable away from $0$, which 1 grants.
$\operatorname {fderiv}_{\mathbb {R}}\, \bigl(z' \mapsto \mathrm{CE}(z', \ell )\bigr)\, z = -\bigl(p_\ell ^{-1} \cdot \operatorname {fderiv}_{\mathbb {R}}\, (z' \mapsto \mathrm{softmax}(z')_\ell )\, z\bigr)$.
proof: $\mathrm{CE}(\cdot , \ell ) = -\log \circ (z' \mapsto \mathrm{softmax}(z')_\ell )$; HasFDerivAt.log with 1 differentiates the $\log $, then negate.
$\operatorname {fderiv}_{\mathbb {R}}\, (z' \mapsto \mathrm{softmax}(z')_\ell )\, z\, (\mathbf{e}_j) = \operatorname {pdiv}(\mathrm{softmax})\, z\, j\, \ell = p_\ell \, (\delta _{j\ell } - p_j)$.
proof: Definition 1, then the softmax Jacobian (Theorem 65).
q.e.d.
proof: Chain 2–4: $-p_\ell ^{-1} \cdot p_\ell \, (\delta _{j\ell } - p_j) = p_j - \delta _{j\ell }$, cancelling by 1; and $\mathrm{onehot}(\ell )_j = \delta _{j\ell }$ by definition.

Theorem 18 Dense VJP

✓

$\mathsf{HasVJP}\, \bigl(\mathrm{dense}(W, b)\bigr)$: the input-gradient backward of a dense layer is multiplication by $W$.

Proof ▶

Define $B(x, dy) := W\, dy$, i.e. $B(x, dy)_i = \sum _j W_{ij}\, dy_j$.
suffices: for all $x$, $dy$, $i$: $B(x, dy)_i = \sum _j \operatorname {pdiv}\bigl(\mathrm{dense}(W, b)\bigr)\, x\, i\, j \cdot dy_j$.
proof: Definition 9, with $B$ as the candidate backward function.
q.e.d.
proof: The Dense Jacobian (Theorem 14) gives $\operatorname {pdiv}\bigl(\mathrm{dense}(W, b)\bigr)\, x\, i\, j = W_{ij}$; substituting into 2 leaves exactly $\sum _j W_{ij}\, dy_j = B(x, dy)_i$.

Theorem 19 Dense weight gradient is the outer product

✓

$dW = x \otimes dy$. Phase 7 promoted from vacuous rfl to theorem: with $F$ and $\varphi $ as in Theorem 15, prove: for all $i, j$:

\[ (x \otimes dy)_{ij} = \sum _k \operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j)\, k \cdot dy_k. \]

Proof ▶

Each summand: $\operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j)\, k \cdot dy_k = (\text{if } k = j \text{ then } x_i \text{ else } 0) \cdot dy_k$.
proof: Dense Jacobian wrt weight (Theorem 15).
q.e.d.
proof: The sum in 1 collapses at $k = j$ (Finset.sum_eq_single) to $x_i \cdot dy_j$, which is $(x \otimes dy)_{ij}$ by definition of the outer product.

Theorem 20 Dense bias gradient is identity

✓

$db = dy$. Phase 7: derived, no new axiom. prove: for all $i$:

\[ dy_i = \sum _j \operatorname {pdiv}\bigl(b' \mapsto \mathrm{dense}(W, b')\, x\bigr)\, b\, i\, j \cdot dy_j. \]

Proof ▶

$\operatorname {pdiv}\bigl(b' \mapsto \mathrm{dense}(W, b')\, x\bigr)\, b\, i\, j = \delta _{ij}$.
proof: As a function of $b'$, the layer is $(\text{constant in } b') + b'$: sum rule (Theorem 3), constant rule (Theorem 6), and identity Jacobian (Theorem 5). (This is the Lean lemma pdiv_dense_b.)
q.e.d.
proof: Substitute 1: $\sum _j \delta _{ij}\, dy_j = dy_i$ (Finset.sum_eq_single).

Definition 21 ReLU VJP

✓

noncomputable def over the canonical pdiv-derived witness; HasVJP.correct holds by rfl since $\operatorname {pdiv}$ is a def over $\operatorname {fderiv}$. At non-smooth points the canonical backward is $\operatorname {fderiv}$’s junk default of $0$; the codegen substitutes the standard subgradient convention.

Definition 22 MLP composition VJP

✓

noncomputable def over the canonical pdiv-derived witness; same shape as relu_has_vjp. Codegen routes the ReLU subgradient at the kink.

2.2 Example: MNIST MLP

The theorems above are the calculus. Here is a concrete architecture built from those pieces: a three-layer fully-connected classifier for 28$\times $28 MNIST digits.

Dataset overview

MNIST is the standard testbed for image-recognition learning. It’s a collection of 28$\times $28 grayscale images of handwritten digits 0–9: 60 000 training images and 10 000 test images, each with a label indicating which digit was drawn. The dataset has been around since 1998 — at 784 pixels per image and 10 classes, MNIST is small enough that you can train a competitive model on a laptop CPU in minutes while still having a nontrivial learning problem.

Our goal in this chapter is to correctly classify a held-out test digit based on a model trained from the 60 000 training digits. We’re going to ignore the 2D spatial structure of the image entirely for now — just flatten each 28$\times $28 image into a 784-dim vector and treat it as a plain supervised-learning classification problem. This is the multilayer perceptron (MLP). Chapter 3 revisits MNIST with convolutions that respect the spatial structure.

Architecture

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.62cm, minimum width=4.0cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, dense/.style = {col, draw=orange!65!black, fill=orange!12}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, ] % Vertical layer column: input at top -> logits at bottom. \node[io] (input) {Input \;\; $28\times28$ flatten $\to 784$}; \node[dense, below=0.24cm of input] (d1) {\textbf{Dense} $784\to512$, ReLU}; \node[dense, below=0.24cm of d1] (d2) {\textbf{Dense} $512\to512$, ReLU}; \node[head, below=0.24cm of d2] (d3) {\textbf{Dense} $512\to10$ \;(identity)}; \node[logits, below=0.24cm of d3] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/d1, d1/d2, d2/d3, d3/out} \draw[arr] (\a) -- (\b); \end{tikzpicture}$

Three dense layers stacked with ReLU non-linearities between them: $784 \to 512 \to 512 \to 10$. First layer ingests the flattened image. The two hidden layers let the network learn nonlinear features. The final layer maps to 10-dimensional logits, one per digit class.

Code: the full training program

The entire training program in Lean is about twenty-five lines. This is the old-school SGD baseline — the exact config the Swift for TensorFlow book used in its Chapter 1 (SGD with learning rate 0.1, 12 epochs, no regularization tricks). In our repo this config is s4tfBaseline in MainAblation.lean:

-- 1
import LeanMlir

-- 2
def mnistMlp : NetSpec where
  name   := "MNIST-MLP"
  imageH := 28
  imageW := 28
  layers := [
    .dense 784 512 .relu,
    .dense 512 512 .relu,
    .dense 512  10 .identity
  ]

-- 3
def s4tfBaseline : TrainConfig where
  learningRate   := 0.1    -- old-school SGD learning rate
  batchSize      := 128
  epochs         := 12
  useAdam        := false  -- plain SGD, no momentum, no moment buffers
  weightDecay    := 0.0
  cosineDecay    := false
  warmupEpochs   := 0
  augment        := false
  labelSmoothing := 0.0

-- 4
def main (args : List String) : IO Unit :=
  mnistMlp.train s4tfBaseline (args.head?.getD "data") .mnist

Walking through the numbered sections:

1. The import. One line. LeanMlir is the framework implemented in the LeanMlir/ directory of the repo — it exposes NetSpec, TrainConfig, and the train method. Everything downstream in this chapter and the next is a specialization of this framework.

2. The architecture. NetSpec is a plain data structure: a name, input dimensions, and a list of Layer values. Our three layers are .dense 784 512 .relu, .dense 512 512 .relu, .dense 512 10 .identity. Read left-to-right: input size, output size, activation. That’s the entire model definition — no class, no forward-pass function, no @differentiable attribute. The forward pass is derivable from the layer list, and the backward pass is proved correct (see the theorems earlier in this chapter).

3. The training hyperparameters. Plain SGD at learning rate 0.1, batch size 128, 12 epochs. No Adam, no weight decay, no cosine schedule, no warmup, no augmentation, no label smoothing. This is the deliberately minimal “Chapter 1” baseline — the Swift for TensorFlow book shipped exactly this config as its reference example, and it’s a pedagogically useful starting point because every later chapter’s tweak — training-recipe (Adam, cosine, warmup, weight decay, augmentation, label smoothing) and architectural (BN, residuals) — then has a clean baseline to measure against. The repo’s MainAblation.lean ships a dozen other configs beside this one so you can run each ablation and see the delta.

4. The program entry point. One line. Call mnistMlp.train with the config, the data directory (default data/), and the dataset kind (.mnist). Everything inside .train (walked through in Chapter 1, “What’s inside .train?”) — data loading, mini-batching, forward-pass MLIR generation, IREE compilation, gradient-descent loop, evaluation loop, logging — is already formalized in the framework. The user-facing program is just “specify the network, specify the knobs, run.”

That’s the whole thing. The parts people usually have to write by hand every time (the training loop, the eval loop, the gradient- computation machinery) are hidden inside the framework because they were the same every time. The correctness of what’s inside .train is what the proof chapters prove.

Results

Build the ablation runner and invoke it with the mlp-sgd config. Output below is from a real run, captured verbatim from logs/ablation_mlp-sgd.log, on an AMD 7900 XTX (ROCm, gfx1100):

$ lake build ablation
$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/ablation mlp-sgd
Ablation: mlp-sgd
  spec: MNIST-MLP, optimizer: SGD
  lr: 0.100000, cosine: false, wd: 0.000000
  aug: false, label_smooth: 0.000000
MNIST-MLP-mlp-sgd: 669706 params
Generating train step MLIR...
  10375 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 60000 images (784 floats/image)
training: 468 batches/epoch, batch=128, SGD, lr=0.100000
  step 0/468: loss=2.437782 (34ms)
Epoch  1/12: loss=0.218235 lr=0.100000 (9417ms)
Epoch  2/12: loss=0.082146 lr=0.100000 (9269ms)
Epoch  3/12: loss=0.050954 lr=0.100000 (9280ms)
Epoch  4/12: loss=0.035614 lr=0.100000 (9171ms)
Epoch  5/12: loss=0.025503 lr=0.100000 (9119ms)
Epoch  6/12: loss=0.017866 lr=0.100000 (9118ms)
Epoch  7/12: loss=0.011806 lr=0.100000 (9153ms)
Epoch  8/12: loss=0.009618 lr=0.100000 (9169ms)
Epoch  9/12: loss=0.005895 lr=0.100000 (9250ms)
Epoch 10/12: loss=0.003566 lr=0.100000 (9262ms)
  val accuracy: 9795/9984 = 98.11%
Epoch 11/12: loss=0.002229 lr=0.100000 (9114ms)
Epoch 12/12: loss=0.000999 lr=0.100000 (9093ms)
  val accuracy: 9841/9984 = 98.57%
Saved params + BN stats.

Twelve epochs, about 9 seconds per epoch, $\sim $110 seconds total. Training loss drops from $2.44$ at step 0 (random-initialization cross-entropy on 10 classes: $\log 10 \approx 2.303$, plus a bit of noise) down below $10^{-3}$ by the end. Final test accuracy lands at 98.57%, which is the going rate for a three-layer MLP on MNIST with plain SGD. The matching CPU run via the Docker image in Appendix B takes about 5 minutes and lands at the same accuracy.

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Training loss}, ymode=log, xmin=0, xmax=13, ymin=0.0007, ymax=0.4, xtick={0,2,4,6,8,10,12}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,0.218235) (2,0.082146) (3,0.050954) (4,0.035614) (5,0.025503) (6,0.017866) (7,0.011806) (8,0.009618) (9,0.005895) (10,0.003566) (11,0.002229) (12,0.000999) }; \end{axis} \end{tikzpicture}$

MNIST MLP ($784{\to }512{\to }512{\to }10$), SGD 0.1, 12 epochs, log-scale loss (logs/ablation_mlp-sgd.log). Per-epoch training loss falls $\sim $200$\times $ from the first epoch to below $10^{-3}$; final test accuracy 98.57%.

Chapter 1 (MNIST 2D CNN) runs the same s4tfBaseline config against a CNN architecture for a direct apples-to-apples comparison. Later chapters switch to the modern recipe (Adam + cosine + warmup + weight decay + label smoothing + data augmentation) once we can benchmark the deltas against this baseline.

2.3 Return on width

The $512{\to }512$ hidden size is a convention, not a measurement. Because the verified renderer is parametric in the layer dimensions — the same mlp_has_vjp theorem covers every width, being polymorphic in $d_0,d_1,d_2,d_3$ — we can sweep the hidden width and train each point on its own proof-rendered StableHLO. The mnist-mlp-grid driver does exactly this: it renders $784{\to }d{\to }d{\to }10$ from the faithful emitter and trains it end to end. Sweeping $d$ over the powers of two from 8 to 4096 (each 12-epoch SGD, on the GPU) gives the accuracy-vs-width curve below.

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Hidden width $d$ (both hidden layers)}, ylabel={MNIST test accuracy (\%)}, xmode=log, log basis x=2, xmin=6.5, xmax=5000, ymin=90.5, ymax=98.7, xtick={8,16,32,64,128,256,512,1024,2048,4096}, xticklabels={8,16,32,64,128,256,512,1024,2048,4096}, ytick={91,92,93,94,95,96,97,98}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1.5pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (8,91.42) (16,94.47) (32,96.18) (64,97.19) (128,97.48) (256,97.72) (512,97.82) (1024,97.89) (2048,97.95) (4096,98.03) }; \addplot[only marks, mark=o, mark size=4pt, red, line width=1pt, forget plot] coordinates {(64,97.19)}; \node[anchor=west, font=\footnotesize, red!70!black] at (axis cs:74,95.5) {knee $\approx 64$: 97.2\% at 55K params}; \end{axis} \end{tikzpicture}$

Return on width for the verified MNIST MLP ($784{\to }d{\to }d{\to }10$): test accuracy versus hidden width $d$ (log scale), 12-epoch SGD on the proof-rendered StableHLO (runs/mlp_grid_results.tsv). The curve flattens hard after $\sim $64 neurons: widening $8{\to }64$ buys $+5.8$ points, but $64{\to }4096$ — $64\times $ the width, $364\times $ the parameters — buys only $+0.8$. A $32{\to }32$ MLP already reaches 96.2% at 26K parameters, $760\times $ smaller than the $4096{\to }4096$ net (98.0%). The canonical $512{\to }512$ sits comfortably on the plateau; anything past it is paying for the third decimal place. Off the diagonal there is a cliff worth noting: a $784{\to }8{\to }4096{\to }10$ net collapses to chance (9.8%) — a narrow layer feeding a wide one starves the wide layer of signal and it never trains under fixed-learning-rate SGD.

2.4 MLIR: Dense

What is already proven. mlpForward is the three-dense-layer network; mlp_has_vjp_at proves its reverse-mode derivative (the vector–Jacobian product) by chaining the per-layer chain rule vjp_comp_at; the parameter gradients are pinned by dense_weight_grad_correct and dense_bias_grad_correct, and the loss gradient by softmaxCE_grad. So what the gradient is, is not in question.

The gap and how we close it. The emitted backward is a value of a Lean datatype, Back, with a denotation $[\! [\cdot ]\! ]$ into the proofs’ own vector type. For the MLP the graph is

\[ \texttt{emitMlpBack} = \texttt{dotGeneral}\, W_0 \circ \texttt{selectPos}\, p_0 \circ \texttt{dotGeneral}\, W_1 \circ \texttt{selectPos}\, p_1 \circ \texttt{dotGeneral}\, W_2, \]

rooted at the incoming cotangent, with $[\! [\texttt{dotGeneral}\, W]\! ] = \texttt{Mat.mulVec}\, W$ and $[\! [\texttt{selectPos}\, p]\! ] = (v \mapsto \text{if } p{\gt}0 \text{ then } v \text{ else } 0)$, and the bridge theorem

\[ \texttt{mlp\_ whole\_ bridge}:\quad [\! [\texttt{emitMlpBack}]\! ](dy) = \texttt{mlp\_ has\_ vjp\_ at}.\mathrm{backward}(dy) \]

says the denotation of that graph is the proven derivative; the forward, the loss cotangent, and the parameter gradients are covered the same way. The printer walks the graph and emits one stablehlo op per node:

Here is exactly what the printer emits for that graph — nothing hand-written, one stablehlo op per IR node (default precision attributes elided for the page):

func.func @mlp_back(%dy: tensor<2x2xf32>, %W0: tensor<4x3xf32>,
    %W1: tensor<3x3xf32>, %W2: tensor<3x2xf32>,
    %p0: tensor<2x3xf32>, %p1: tensor<2x3xf32>) -> tensor<2x4xf32> {
  %bk0 = stablehlo.dot_general %dy, %W2, contracting_dims = [1] x [1]
           : (tensor<2x2xf32>, tensor<3x2xf32>) -> tensor<2x3xf32>
  %bk1 = stablehlo.constant dense<0.0> : tensor<2x3xf32>
  %bk2 = stablehlo.compare GT, %p1, %bk1
           : (tensor<2x3xf32>, tensor<2x3xf32>) -> tensor<2x3xi1>
  %bk3 = stablehlo.select %bk2, %bk0, %bk1
           : tensor<2x3xi1>, tensor<2x3xf32>
  %bk4 = stablehlo.dot_general %bk3, %W1, contracting_dims = [1] x [1]
           : (tensor<2x3xf32>, tensor<3x3xf32>) -> tensor<2x3xf32>
  %bk5 = stablehlo.constant dense<0.0> : tensor<2x3xf32>
  %bk6 = stablehlo.compare GT, %p0, %bk5
           : (tensor<2x3xf32>, tensor<2x3xf32>) -> tensor<2x3xi1>
  %bk7 = stablehlo.select %bk6, %bk4, %bk5
           : tensor<2x3xi1>, tensor<2x3xf32>
  %bk8 = stablehlo.dot_general %bk7, %W0, contracting_dims = [1] x [1]
           : (tensor<2x3xf32>, tensor<4x3xf32>) -> tensor<2x4xf32>
  return %bk8 : tensor<2x4xf32>
}

Read it against the graph: each dot_general (%bk0, %bk4, %bk8) is a dotGeneral $W$ node, whose denotation is $\texttt{Mat.mulVec}\, W$; each compare GT + select pair (%bk2/%bk3, %bk6/%bk7) is a selectPos $p$ node — the ReLU subgradient. The bridge theorem is precisely the claim that this text computes mlp_has_vjp_at’s backward.

Caveats.

ReLU is a smooth-point bridge. Where a pre-activation is exactly zero ReLU has no derivative; the emitted compare GT 0 sends that case to $0$, and the bridge is permitted to fail on that measure-zero set.
Representative scale. Shown small; the trained net is $784{\to }512{\to }512{\to }10$.
Trusted surface. The printer’s faithfulness, IREE’s lowering, and float32 $\approx \mathbb {R}$ are tested, not proved; the full accounting is in Appendix C.5.