3 MNIST: 2D CNN

The MLP wastes information; we can do better

Chapter 2’s MLP recognized digits by flattening every 28$\times $28 image into a 784-dim vector and feeding it through three dense layers. It worked. But it worked by ignoring something the data was telling us. Pixel $(5, 5)$ and pixel $(5, 6)$ are neighbors: they were next to each other on the page where the digit was drawn. Pixel $(5, 5)$ and pixel $(23, 14)$ are not. The dense layer has no idea. It treats every input coordinate as an independent feature; its weight matrix has no notion that some pairs of pixels are geometrically close. A trained MLP eventually learns to compensate, but it has to learn it from the data, with the data, by paying the full price for each unrelated feature pair.

A handwritten “5” drawn in the top-left of the image and the same “5” drawn in the bottom-right are the same digit. The MLP sees them as completely unrelated vectors. It can be trained to classify both, but each is its own learning problem; nothing it learned about one transfers to the other for free.

Convolutional networks fix this by building three priors directly into the architecture—priors that the data has but the MLP threw away.

Three properties baked into a convolution

Locality. Each output value depends on only a small neighborhood of inputs. A convolution with a $3 \times 3$ kernel computes each output pixel as a weighted sum of its 9 neighbors. Faraway pixels do not enter directly; the network only learns to combine them later, by stacking convolutions and letting receptive fields grow.
Translation invariance. The same weights are applied at every spatial position. A feature detector that recognizes the bottom-left curve of a “5” uses the same parameters whether the curve appears at row 5 or row 25. The architecture imposes the symmetry; the data does not have to teach it.
Parameter sharing. A $3 \times 3 \times 1 \to 32$ convolution has $3 \times 3 \times 1 \times 32 + 32 = 320$ parameters total, regardless of image size. The matching fully-connected layer on a 28$\times $28 image would have $28^2 \times 28^2 \times 32 \approx 20$ million. The convolution gets the same expressive power with four orders of magnitude fewer parameters because it reuses them.

Concretely, a convolution computes

\[ y_{c_{\text{out}},\, i,\, j} \; =\; \sum _{c_{\text{in}},\, k_h,\, k_w}\, W_{c_{\text{out}},\, c_{\text{in}},\, k_h,\, k_w} \cdot x_{c_{\text{in}},\, i + k_h,\, j + k_w} \; +\; b_{c_{\text{out}}}. \]

Each output cell is a small dot product between the kernel $W$ and a window of the input. The sum is over a tiny $3 \times 3$ spatial region and over input channels—usually a handful of terms, not the 784 a dense layer touches.

The Jacobian is sparse and structured

The Jacobian of a convolution looks like a giant matrix, but it isn’t a generic giant matrix—it’s mostly zeros. Specifically, $\partial y_{c_{\text{out}}, i, j} / \partial x_{c_{\text{in}}, i', j'}$ is zero unless $(i', j')$ falls inside the kernel window centered at $(i, j)$. When it does, the entry is just the corresponding weight $W_{c_{\text{out}}, c_{\text{in}}, k_h, k_w}$.

So a convolution is a linear map—like a dense layer—but its Jacobian is highly structured: each row has only $k^2 \times C_{\text{in}}$ nonzero entries, and those entries are reused across many rows because the same kernel is used at every spatial position. The forward picture from Chapter 2 “the Jacobian of a linear map is the linear map” still holds. The Jacobian is just written in a different basis: instead of one big dense matrix it is a list of kernel weights plus a rule for where each one lives.

The backward of a convolution is another convolution

We want $dx$—how a small jiggle in each input pixel changes the loss—given an upstream $dy$. Plugging into the Jacobian above and grinding through the algebra, the answer is:

\[ dx_{c_{\text{in}}, i', j'} \; =\; \sum _{c_{\text{out}},\, k_h,\, k_w}\, W_{c_{\text{out}},\, c_{\text{in}},\, k_h,\, k_w} \cdot dy_{c_{\text{out}},\, i' - k_h,\, j' - k_w}. \]

That is itself a convolution. Specifically, it is a convolution of the upstream gradient $dy$ against the same kernel $W$, but with the spatial axes reversed and the input/output channel axes swapped—a “transposed convolution”. The backward of a CNN block reuses the same kind of operation as the forward; no new primitive is needed.

The weight gradient $dW$ similarly works out to a convolution, this time of the saved input $x$ against the upstream gradient $dy$ (the “transpose trick”). The bias gradient $db$ is just $dy$ summed over spatial positions per channel. Theorems 25, 26, and 27 formalize each of these.

Max pooling: the winner gets the gradient

The other new primitive in this chapter is max pooling. A $2 \times 2$ max pool collapses each $2 \times 2$ block of the input into a single output value: the max of the four. It has no parameters; its purpose is to halve the spatial resolution while keeping the strongest local response.

The Jacobian of $\max $ is degenerate. At a generic point, only one of the four inputs is the maximum, and the output’s slope is $1$ with respect to that one input and $0$ with respect to the other three. “Jiggle the max” moves the output; “jiggle a non-max” does nothing. So the backward of max pool routes the upstream gradient back to whichever input was the winner in each window, and zeros out the rest. At ties the codegen breaks deterministically (same convention as ReLU’s kink). Definition 29 states this formally.

The rest of the network is Chapter 2

After two convolutions and a max pool, the feature map is $14 \times 14 \times 32$. At that point we flatten the $14 \cdot 14 \cdot 32 = 6272$-dim feature vector and feed it into the same three-dense-layer head from Chapter 2, ending in 10-class logits. The backward through the dense head and the softmax-CE loss is exactly the calculation we did in the previous chapter—we do not re-prove it here. The composition rule from Chapter 1 guarantees that the full network’s VJP is just the per-layer VJPs strung together in reverse order: softmax-CE backward, dense backward $\times 3$, flatten backward, max-pool backward, conv2d backward $\times 2$. Each piece is its own theorem; together they form a working trainer for MNIST that respects the image’s 2-D structure.

3.1 The theorems

Definition 23 Conv2d forward

✓

Concrete $\sum _{c,kh,kw}$ cross-correlation with SAME padding (Phase 7). Codegen emits stablehlo.convolution; proofs reason about it via the explicit definition.

Definition 24 3D VJP record

✓

The rank-3 analogue of Definition 9. For $f : \mathrm{Tensor3} \to \mathrm{Tensor3}$, define $\operatorname {pdiv}_3 f\, x\, (c_i, h_i, w_i)\, (c_o, h_o, w_o)$ as $\operatorname {pdiv}$ of the flattened map $v \mapsto \mathrm{flatten}\bigl(f(\mathrm{unflatten}\, v)\bigr)$ with both indices decoded through the $(c, h, w) \leftrightarrow \text{flat}$ bijection. Then $\mathsf{HasVJP3}\, f$ bundles a backward function $B$ with: for all $x$, $dy$, $(c_i, h_i, w_i)$,

\[ B(x, dy)_{c_i h_i w_i} = \sum _{c_o, h_o, w_o} \operatorname {pdiv}_3 f\, x\, (c_i, h_i, w_i)\, (c_o, h_o, w_o) \cdot dy_{c_o h_o w_o}. \]

Theorem 25 Conv2d input VJP

✓

$\mathsf{HasVJP3}\, \bigl(\mathrm{conv2d}(W, b)\bigr)$: the input backward of a convolution is the reversed-kernel convolution. Phase 1 (Apr 2026). No hypotheses: conv2d is affine in its input.

Proof ▶

Sketch: same split–distribute–product-rule–collapse skeleton as the dense Jacobians (Theorems 14, 15), with two convolution-specific twists: the finite-sum rule fires three times (over $c, k_h, k_w$), and the product rule’s variable factor is a pad-conditional projection — a case split on the SAME-padding condition decides between a coordinate projection and the zero function.

Define $B(x, dy)_{c_i h_i w_i} := \sum _{c_o, h_o, w_o} [\text{pad-valid}]\; W_{c_o, c_i, \hat{k}_h, \hat{k}_w} \cdot dy_{c_o h_o w_o}$, where $\hat{k}_h = h_i + p_H - h_o$, $\hat{k}_w = w_i + p_W - w_o$ are the reconstructed kernel offsets and $[\text{pad-valid}]$ requires them to lie in $[0, k_H) \times [0, k_W)$. Under the substitution $k_h \leftrightarrow k_H - 1 - k_h$ this is the reversed-kernel, channel-swapped convolution of $dy$ against $W$ that the MLIR backward emits (transpose + reverse + convolution).
suffices: for all $x$, $dy$, $(c_i, h_i, w_i)$: $B(x, dy)_{c_i h_i w_i}$ equals the $\operatorname {pdiv}_3$ triple sum.
proof: Definition 24, with $B$ as the candidate backward function.
The flattened forward $v \mapsto \mathrm{flatten}\bigl(\mathrm{conv2d}(W, b)\, (\mathrm{unflatten}\, v)\bigr)$ is affine in $v$: a constant bias term plus $\sum _{c, k_h, k_w} W_{o(\mathrm{idx}), c, k_h, k_w} \cdot (\text{pad-conditional projection of } v)$.
proof: Unfold $\mathrm{conv2d}$ and $\mathrm{flatten}$; definitional (Definition 23).
Per-index Jacobian: for every flat $\mathrm{idx_{in}}$, $\mathrm{idx_{out}}$, $\operatorname {pdiv}$(flattened forward) $= \sum _{c, k_h, k_w} W_{o(\mathrm{idx_{out}}), c, k_h, k_w} \cdot [\text{pad-valid} \wedge \mathrm{idx_{in}} = \text{encoded input position}]$.
proof: By 3: sum rule + constant rule (Theorems 3, 6) drop the bias; the finite-sum rule (Theorem 8) fires three times over $(c, k_h, k_w)$; per summand the product rule (Theorem 4) applies with the $W$-factor constant, and a case split on the pad condition leaves either a coordinate projection — a reindex, whose Jacobian is the Kronecker indicator (Theorem 7) — or the zero function (Theorem 6).
Contract 4 with $dy$ and collapse: re-index the sum over flat $\mathrm{idx_{out}}$ to $(c_o, h_o, w_o)$; for each fixed output position the triple sum over $(c, k_h, k_w)$ collapses by three Finset.sum_eq_single applications at $c = c_i$, $k_h = \hat{k}_h$, $k_w = \hat{k}_w$.
proof: The indicator in 4 is nonzero for exactly that one summand.
q.e.d.
proof: What remains after 5 is exactly $B(x, dy)_{c_i h_i w_i}$ from 1; with 2 this is the goal.

Theorem 26 Conv2d weight VJP

✓

Phase 7: the transpose-trick formula. As with the dense weight Jacobian (Theorem 15), differentiate in the flattened kernel: let $F(v) := \mathrm{flatten}\bigl(\mathrm{conv2d} (\mathrm{unflatten}\, v,\, b)\, x\bigr)$ for $v \in \mathbb {R}^{o_c \cdot i_c \cdot k_H \cdot k_W}$. prove: $\mathsf{HasVJP}\, F$, with backward $dW_{o, c, k_h, k_w} = \sum _{h_i, w_i} [\text{pad-valid}]\; x_{c,\, h_i + k_h - p_H,\, w_i + k_w - p_W} \cdot dy_{o, h_i, w_i}$ — a convolution of the saved input against the upstream gradient.

Proof ▶

Sketch: the dual of Theorem 25 — the same affine decomposition, but now $v$ (the kernel) sits in the reindex factor and the saved input $x$ is the constant.

Define $B(v, dy)_{\mathrm{idx}}$, decoding $\mathrm{idx} \leftrightarrow (o', c', k_h', k_w')$, as $\sum _{h_i, w_i} (\text{pad-conditional } x \text{ value at } (c', h_i + k_h' - p_H, w_i + k_w' - p_W)) \cdot dy_{\mathrm{flat}(o', h_i, w_i)}$.
suffices: for all $v$, $dy$, $\mathrm{idx}$: $B(v, dy)_{\mathrm{idx}} = \sum _{\mathrm{idx_{out}}} \operatorname {pdiv}F\, v\, \mathrm{idx}\, \mathrm{idx_{out}} \cdot dy_{\mathrm{idx_{out}}}$.
proof: Definition 9, with $B$ as the candidate backward function.
$F$ is affine in $v$: constant bias plus $\sum _{c, k_h, k_w} v_{\varphi (o(\mathrm{idx_{out}}), c, k_h, k_w)} \cdot (\text{pad-conditional } x \text{ term})$, where $\varphi $ is the kernel-index bijection (Kernel4.flatten).
proof: Unfold $\mathrm{conv2d}$ and the flattens; definitional (Definition 23).
Per-index Jacobian: $\operatorname {pdiv}F\, v\, \mathrm{idx}\, \mathrm{idx_{out}} = \sum _{c, k_h, k_w} [\mathrm{idx} = \varphi (o(\mathrm{idx_{out}}), c, k_h, k_w)] \cdot (\text{pad-conditional } x \text{ term})$.
proof: By 3: sum rule + constant rule (Theorems 3, 6) drop the bias; finite-sum rule (Theorem 8) $\times 3$; product rule (Theorem 4) with the $x$-factor constant and the $v$-factor a reindex through $\varphi $, whose Jacobian is the Kronecker indicator (Theorem 7).
Contract 4 with $dy$ and collapse: the $(c, k_h, k_w)$ sum collapses on $\mathrm{idx}$’s decode $(c', k_h', k_w')$ by injectivity of $\varphi $ (Finset.sum_eq_single $\times 3$); re-index $\mathrm{idx_{out}}$ to $(o, h_i, w_i)$ and collapse the channel sum at $o = o'$.
proof: The indicator forces $o = o'$; the spatial sums survive.
q.e.d.
proof: What remains after 5 is exactly $B(v, dy)_{\mathrm{idx}}$ from 1; with 2 this is the goal.

Theorem 27 Conv2d bias VJP

✓

Phase 9: sum the cotangent over spatial dims per channel. prove: $\mathsf{HasVJP}\, \bigl(b \mapsto \mathrm{flatten} (\mathrm{conv2d}(W, b)\, x)\bigr)$, with backward $db_o = \sum _{h_i, w_i} dy_{\mathrm{flat}(o, h_i, w_i)}$.

Proof ▶

Define $B(b, dy)_o := \sum _{h_i, w_i} dy_{\mathrm{flat}(o, h_i, w_i)}$.
suffices: for all $b$, $dy$, $o$: $B(b, dy)_o = \sum _{\mathrm{idx}} \operatorname {pdiv}\bigl(b' \mapsto \mathrm{flatten}(\mathrm{conv2d}(W, b')\, x)\bigr)\, b\, o\, \mathrm{idx} \cdot dy_{\mathrm{idx}}$.
proof: Definition 9, with $B$ as the candidate backward function.
As a function of $b$, the flattened forward is $(\text{channel reindex of } b) + (\text{$W\! ,x$ term constant in } b)$, so $\operatorname {pdiv}= \delta _{o,\, \mathrm{chan}(\mathrm{idx})}$.
proof: Sum rule (Theorem 3), reindex Jacobian (Theorem 7) with $\sigma = \mathrm{idx} \mapsto \mathrm{chan}(\mathrm{idx})$, constant rule (Theorem 6).
q.e.d.
proof: Contract 3 with $dy$: re-index the flat sum to $(c, h_i, w_i)$; the Kronecker keeps exactly channel $o$’s spatial positions (Finset.sum_eq_single), leaving $B(b, dy)_o$; with 2 this is the goal.

Definition 28 MaxPool 2x2 stride 2 forward

✓

Concrete four-way max over 2x2 windows (Phase 7).

Definition 29 MaxPool input VJP

✓

noncomputable def over the canonical pdiv-derived witness. HasVJP.correct holds by rfl; codegen substitutes the standard argmax routing convention at tiebreaks.

3.2 Example: MNIST 2D CNN

The MLP in Chapter 2 crushed MNIST by flattening the image into a 784-dim vector and throwing dense layers at it. That works because MNIST is small, but it also throws away all the spatial structure: pixel (5,5) and pixel (5,6) are neighbors, but the MLP sees no more relationship between them than it does between pixel (5,5) and pixel (23,14). Convolutions fix that. This chapter does MNIST again, the same way the Swift for TensorFlow book’s Chapter 1 did it: two plain $3 \times 3$ convs to pull out local features, a max-pool to collapse the spatial grid, then the same dense head you already know.

Architecture

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.62cm, minimum width=4.0cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, conv/.style = {col, draw=orange!65!black, fill=orange!12}, pool/.style = {col, draw=teal!60!black, fill=teal!10}, flat/.style = {col, draw=purple!60!black, fill=purple!8}, dense/.style = {col, draw=orange!65!black, fill=orange!12}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, ] % Vertical layer column: input at top -> logits at bottom. \node[io] (input) {Input \;\; $28\times28\times1$}; \node[conv, below=0.24cm of input] (c1) {\textbf{Conv2D} $1\to32$, $3\times3$, ReLU}; \node[conv, below=0.24cm of c1] (c2) {\textbf{Conv2D} $32\to32$, $3\times3$, ReLU}; \node[pool, below=0.24cm of c2] (p1) {\textbf{maxPool} $2\times2$ \;\; $28\to14$}; \node[flat, below=0.24cm of p1] (fl) {flatten \;\; $14\times14\times32 \to 6272$}; \node[dense, below=0.24cm of fl] (d1) {\textbf{Dense} $6272\to512$, ReLU}; \node[dense, below=0.24cm of d1] (d2) {\textbf{Dense} $512\to512$, ReLU}; \node[head, below=0.24cm of d2] (d3) {\textbf{Dense} $512\to10$ \;(identity)}; \node[logits, below=0.24cm of d3] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/c1, c1/c2, c2/p1, p1/fl, fl/d1, d1/d2, d2/d3, d3/out} \draw[arr] (\a) -- (\b); \end{tikzpicture}$

Two convolutions lift the 1-channel grayscale input to 32 channels, maxPool halves the spatial resolution, then the flattened $14 \times 14 \times 32 = 6272$ feature vector goes through three dense layers into 10-class logits. No batch norm yet — BN shows up in Chapter 4.

Code: the full training program

Same format as the MLP chapter. The only thing that changes from Chapter 2’s training program is the NetSpec — hyperparameters stay exactly the same s4tfBaseline config.

-- 1
import LeanMlir

-- 2
def mnistCnnNoBn : NetSpec where
  name   := "MNIST-CNN-noBN"
  imageH := 28
  imageW := 28
  layers := [
    .conv2d 1 32 3 .same .relu,
    .conv2d 32 32 3 .same .relu,
    .maxPool 2 2,
    .flatten,
    .dense 6272 512 .relu,
    .dense 512 512 .relu,
    .dense 512  10 .identity
  ]

-- 3 (identical to the MLP chapter's s4tfBaseline)
def s4tfBaseline : TrainConfig where
  learningRate   := 0.1
  batchSize      := 128
  epochs         := 12
  useAdam        := false
  weightDecay    := 0.0
  cosineDecay    := false
  warmupEpochs   := 0
  augment        := false
  labelSmoothing := 0.0

-- 4
def main (args : List String) : IO Unit :=
  mnistCnnNoBn.train s4tfBaseline (args.head?.getD "data") .mnist

Walking through the numbered sections:

1. The import. Same as Chapter 2.

2. The architecture. Seven layers: two .conv2d layers at 32 channels, one .maxPool, .flatten, three .dense layers down to 10. Each .conv2d takes in-channels, out-channels, kernel size, padding, activation — five parameters, no classes. Same data-first style as the MLP spec.

The .conv2d backward pass is the theorem from earlier in this chapter (§ 25): the input VJP is another convolution, using the kernel reversed along both spatial axes. The max-pool backward (§ 29) routes the gradient to the argmax position in each $2 \times 2$ window. The dense backward is already proved in Chapter 2 (§ 18). Adding convolution to the model introduces no new training-loop code — only new layer types, which the framework already knows how to compose.

3. The training hyperparameters. Literally the same s4tfBaseline config as the MLP chapter — SGD 0.1, 12 epochs, no regularization, no augmentation. That’s the framework’s pitch made concrete: swap the architecture, keep the trainer. The ablation framework gives you access to the same variant configs (adamOnly, fullRecipe, etc.) without any more changes on your side.

4. The program entry point. One line, same as the MLP chapter. mnistCnnNoBn.train s4tfBaseline replaces mnistMlp.train s4tfBaseline. That’s the whole diff.

Results

Build and run via the ablation framework:

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/ablation cnn-nobn-sgd
Ablation: cnn-nobn-sgd
  spec: MNIST-CNN-noBN, optimizer: SGD
  lr: 0.100000, cosine: false, wd: 0.000000
  aug: false, label_smooth: 0.000000
MNIST-CNN-noBN-cnn-nobn-sgd: 3489130 params
Generating train step MLIR...
  21884 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 60000 images (784 floats/image)
training: 468 batches/epoch, batch=128, SGD, lr=0.100000
  step 0/468: loss=2.436... (75ms)
Epoch  1/12: loss=0.196587 lr=0.100000 (39327ms)
Epoch  2/12: loss=0.043851 lr=0.100000 (39243ms)
...
Epoch 12/12: loss=(small)   lr=0.100000 (~39000ms)
  val accuracy: 9898/9984 = 98.98%

(Full log in logs/ablation_cnn-nobn-sgd.log.) Twelve epochs, about 40 seconds per epoch on the 7900 XTX, $\sim $8 minutes total. Final test accuracy: 98.98%, up from the MLP’s 98.57% on the same data, same optimizer, same number of epochs. The delta is 0.4 percentage points — not huge, but consistent and free: the convolution just finds spatial features the MLP couldn’t see.

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Training loss}, ymode=log, xmin=0, xmax=13, ymin=0.0007, ymax=0.4, xtick={0,2,4,6,8,10,12}, legend pos=north east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,0.218235) (2,0.082146) (3,0.050954) (4,0.035614) (5,0.025503) (6,0.017866) (7,0.011806) (8,0.009618) (9,0.005895) (10,0.003566) (11,0.002229) (12,0.000999) }; \addlegendentry{MLP} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (1,0.196587) (2,0.043851) (3,0.025374) (4,0.014667) (5,0.009847) (6,0.008109) (7,0.008859) (8,0.005455) (9,0.008056) (10,0.005157) (11,0.004191) (12,0.001768) }; \addlegendentry{CNN} \end{axis} \end{tikzpicture}$

MNIST, same SGD 0.1 / 12-epoch recipe, log-scale training loss (logs/ablation_{mlp,cnn-nobn}-sgd.log). The CNN converges faster early (half the MLP’s loss by epoch 2); both bottom out near $10^{-3}$ train loss, but the CNN generalizes better — 98.98% vs 98.57% test.

Why this network is still 99% dense-head

At 3,489,130 total parameters, the no-BN CNN is actually $\sim $5.2$\times $ bigger than the MLP, not smaller. You might have expected the opposite — convolutions are supposed to be parameter-efficient. They are, but the head is what’s eating the budget: the flatten of a $14 \times 14 \times 32$ feature map produces a 6272-dim vector, and the first dense layer (6272 $\to $ 512) alone is 3,211,776 parameters. The entire conv+maxPool backbone is 9,568 parameters — 0.27% of the total.

This is the same “fat FC head” pattern you’ll see in AlexNet, VGG, and YOLOv1: a compact convolutional backbone feeding a massive fully-connected classifier. Historically it drove practitioners crazy — the conv layers were the interesting new machinery, but all the weight budget lived in the part nobody had done anything new with. Every post-2015 vision architecture responded by replacing the dense head with globalAvgPool followed by a single small dense, which is why ResNet-18 has $\sim $11M parameters for 1000-class ImageNet but Chapter 3’s 10-class MNIST CNN has 3.5M. We’ll see the switch firsthand starting with ResNet in Chapter 5.

The MLIR character count also tells a story: the CNN emits 21,884 chars vs the MLP’s 10,375 — about 2.1$\times $ bigger. Adding convolutions and maxPool roughly doubles the emitted compute, not 3.5$\times $ the way our other (BN-equipped) CNN does. Every .convBn adds batch-norm normalize and affine ops; the plain .conv2d used here keeps the StableHLO much tighter.

The fat head is fat because it can be, not because it must be

If the dense head holds 99% of the parameters, the obvious question is how many of them earn their keep. The parametric renderer answers it directly: mnist-cnn-grid holds the convolutional backbone fixed at 32 channels and sweeps only the dense head ($\ldots {\to }d{\to }d{\to }10$), rendering each width from the faithful CNN emitter. Overlaying that sweep on the MLP’s (both trained on their own verified StableHLO) shows how little the head needs once good convolutional features exist:

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.8cm, xlabel={Hidden width $d$ (dense head / MLP hidden layers)}, ylabel={MNIST test accuracy (\%)}, xmode=log, log basis x=2, xmin=6.5, xmax=5000, ymin=90.5, ymax=99.6, xtick={8,16,32,64,128,256,512,1024,2048,4096}, xticklabels={8,16,32,64,128,256,512,1024,2048,4096}, ytick={91,92,93,94,95,96,97,98,99}, legend pos=south east, legend cell align={left}, legend style={font=\footnotesize, draw=none, fill=none}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1.5pt}, ] \addplot[green!55!black, mark=*, mark options={fill=green!55!black}] coordinates { (8,96.16) (16,98.30) (32,98.42) (64,98.86) (128,98.60) (256,98.87) (512,98.95) (1024,99.04) (2048,99.04) (4096,98.96) }; \addlegendentry{CNN (conv@32 fixed, head $d{\to}d$)} \addplot[blue, mark=square*, mark options={fill=blue}] coordinates { (8,91.42) (16,94.47) (32,96.18) (64,97.19) (128,97.48) (256,97.72) (512,97.82) (1024,97.89) (2048,97.95) (4096,98.03) }; \addlegendentry{MLP ($784{\to}d{\to}d{\to}10$)} \end{axis} \end{tikzpicture}$

Classifier-head width sweep for the verified CNN (convolutional backbone held at 32 channels), overlaid on the MLP width sweep of Chapter 2 (runs/cnn_grid_results.tsv). The whole CNN curve sits about a point above the MLP’s, and its knee arrives roughly $4\times $ earlier: a 16-wide head already reaches 98.3% at 110K parameters, beating the widest 4096-wide MLP. So the 3.2M-parameter $6272{\to }512$ layer is not buying accuracy — shrinking the head to $\ldots {\to }16{\to }16{\to }10$ keeps essentially all of it. The head is fat because the flatten hands it a 6272-vector and nobody trimmed it, not because the task needs the width; the only real drop is at $d=8$ (96.2%), where the $6272{\to }8$ bottleneck finally throws away features the convolutions found. (On this GPU the wall-clock is convolution-bound, so the shrink is a parameter win, not a speed win — which is exactly why the post-2015 fix was globalAvgPool, replacing the flatten rather than narrowing it.)

3.3 MLIR: Convolution

What is already proven. cnnForward is the conv $\to $ relu $\to $ conv $\to $ relu $\to $ maxPool $\to $ flatten $\to $ dense network; cnn_has_vjp_at proves its reverse-mode derivative by chaining the per-layer VJPs — the dense backward (§ 18, reused from Chapter 2), the max-pool backward (maxPool2_has_vjp3, § 29), and the convolution backward (conv2d_has_vjp3, § 25) — through the IR-level chain rule vjp_comp_at. The convolution’s reverse-mode derivative is the chapter’s headline result: the input VJP of a conv2d is another conv2d, against the kernel with its in- and out-channels swapped and both spatial axes reversed. The parameter gradients are pinned by conv2d_weight_grad_has_vjp (the transpose trick) and the dense grads from Chapter 2, and the loss gradient by softmaxCE_grad (§ 17). So what the convolution’s gradient is, is not in question.

The gap and how we close it. The emitted backward convolution is given a denotation $[\! [\cdot ]\! ]$ valued in the proofs’ own tensor type,

\[ [\! [\texttt{convBackDenote}\, W]\! ] = \texttt{conv2d}\, (\texttt{reverseSwap}\, W), \qquad (\texttt{reverseSwap}\, W)_{c_i\, c_o\, k_h\, k_w} = W_{c_o\, c_i\, \bar k_h\, \bar k_w}, \]

where $\bar k = K - 1 - k$ is the spatial flip, and the bridge theorem

\[ \texttt{conv\_ back\_ bridge\_ 1to2}:\quad [\! [\texttt{convBackDenote}\, W]\! ](dy) = (\texttt{conv2d\_ has\_ vjp3}\, W\, b).\mathrm{backward}(x, dy) \]

says the denotation of that graph is the proven conv input-VJP — the reversed-kernel identity cnn_has_vjp_at only asserts, discharged here by expansion at the concrete shape (and again at the $2{\to }2$ shape, by conv_back_bridge_2to2). The printer walks the graph and emits one stablehlo op per node — transpose for the channel swap, reverse for the spatial flip, convolution for the contraction:

Here is exactly what the printer emits for the first conv’s backward ($1{\to }2$ channels, $3\times 3$ kernel, at $4\times 4$):

func.func @conv_back(%dy: tensor<1x2x4x4xf32>,
    %W: tensor<2x1x3x3xf32>) -> tensor<1x1x4x4xf32> {
  %Wt = stablehlo.transpose %W, dims = [1, 0, 2, 3]
          : (tensor<2x1x3x3xf32>) -> tensor<1x2x3x3xf32>
  %Wr = stablehlo.reverse %Wt, dims = [2, 3] : tensor<1x2x3x3xf32>
  %dx = stablehlo.convolution(%dy, %Wr)
      dim_numbers = [b, f, 0, 1]x[o, i, 0, 1]->[b, f, 0, 1],
      window = {stride = [1, 1], pad = [[1, 1], [1, 1]],
                lhs_dilate = [1, 1], rhs_dilate = [1, 1]}
      : (tensor<1x2x4x4xf32>, tensor<1x2x3x3xf32>)
          -> tensor<1x1x4x4xf32>
  return %dx : tensor<1x1x4x4xf32>
}

Read it against the graph: %Wt is the transpose dims [1,0,2,3] that swaps the kernel’s out- and in-channel axes; %Wr is the reverse dims [2,3] that flips both spatial axes ($\bar k_h,\bar k_w$); together they build reverseSwap $W$. The convolution then convolves the incoming cotangent %dy against that kernel — exactly $\texttt{conv2d}\, (\texttt{reverseSwap}\, W)$, the $1\, \text{channel} \leftarrow 2\, \text{channels}$ adjoint of the forward conv. The bridge theorem is precisely the claim that this text computes conv2d_has_vjp3’s backward. The max-pool, flatten, dense, and relu nodes of the full backward are bridged the same way — the max-pool tile-compare-select by maxpool_back_bridge, the dense dot_generals exactly as in Chapter 2 — so the whole CNN backward is covered, op by op.

Caveats.

The convolution bridge is unconditional — a convolution is linear, so it holds at every input.
ReLU is a smooth-point bridge (no pre-activation exactly zero), failing only on that measure-zero set.
Max-pool needs a unique argmax — the bridge holds only where every $2\times 2$ window has a strict argmax; the tie set is the codegen trust boundary for pooling.
Representative scale — $1{\to }2$ channels at $4\times 4$ (production: $1{\to }32$ at $28\times 28$).