4 CIFAR with BatchNorm

This chapter is the bridge between the MNIST chapters and the deep-network chapters that follow. Chapter 3 did MNIST with a two-convolution net and a 2$\times $512 dense head; Chapter 5 (ResNet-34) goes thirty-four layers deep. The network here sits exactly in between — it keeps MNIST’s same 2$\times $512 head and the same conv/ReLU/max-pool machinery, but stacks the convolutions four stages deep (eight in all) on a harder dataset (CIFAR-10). That is the first point in the book where depth is enough to make two things bite: BatchNorm starts to earn its keep, and the choice of optimizer starts to matter — and the depth itself is what ResNet then scales to thirty-four layers. So the chapter has two jobs: prove the one new operator (BatchNorm), and use the deeper net to measure what actually governs training (§4.2).

Proving BN is also what makes this structurally the hardest chapter in the book. BN’s inverse-stddev term $1/\sqrt{\sigma ^2 + \varepsilon }$ has a gradient that blows up as $\sigma ^2 \to 0$; proving the backward pass exists and is bounded requires ContinuousLinearMap-based real-analysis machinery from Mathlib that none of the previous chapters needed. If you’re new to formal math, skim the proofs and trust them — the takeaways are concrete:

BN’s gradient has a closed-form 3-term formula (Theorem 34).
The formula needs $\varepsilon {\gt} 0$ to stay bounded (Theorem 32).
BN’s payoff is speed: a deeper network reaches the same accuracy in far fewer epochs with it than without. The example in §4.2 measures that directly.

The proofs themselves use Mathlib’s HasFDerivAt.sqrt, (hasDerivAt_inv).comp_hasFDerivAt, and a centering CLM chained through the chain rule from Ch 1. They’re correct (the Lean kernel checks them) and they’re available in Proofs/BatchNorm.lean for the curious; you do not need to understand them line-by-line to use BN as a layer or to follow the rest of the book. Ch 5 (ResNet-34) is the easiest chapter in the book and follows immediately — this is a localized difficulty spike, not the new normal.

What BN actually is

BatchNorm (Ioffe & Szegedy, 2015) takes a batch of activations, $x$, and does three things in sequence. Each is one line of code; together they are the layer.

Center. Compute the batch mean $\mu = \frac{1}{n} \sum _k x_k$ and subtract it from every sample: $x - \mu $. Output has mean zero.
Normalize. Compute the batch variance $\sigma ^2$, add a small $\varepsilon $ for numerical safety, and divide: $\hat{x} = (x - \mu )/\sqrt{\sigma ^2 + \varepsilon }$. Output has unit variance.
Affine. Scale and shift with learnable per-channel parameters $\gamma $ and $\beta $: $\mathrm{bn}(x) = \gamma \hat{x} + \beta $.

The chapter’s first six theorems are the Jacobian of each step and the VJPs that fall out of them. Theorem 36 closes the loop by composing them. The forward is three steps, so the chain rule from Chapter 1 gives us three Jacobians to multiply, and the “BN three-term backward” is exactly that product written out.

The centering term has an indirect path

When you jiggle a single input $x_i$, you do not only jiggle $x_i$—you also jiggle the batch mean $\mu $, and $\mu $ appears in every sample’s centered value $x_k - \mu $. So jiggling $x_i$ by $\varepsilon $ shifts every centered value by $-\varepsilon /n$ plus the direct $\varepsilon $ on the $i$th sample.

\[ \frac{\partial (x_j - \mu )}{\partial x_i} \; =\; \delta _{ij} - \frac{1}{n}. \]

The $\delta _{ij}$ is the direct effect; the $-1/n$ is the indirect effect through $\mu $. Theorem 31 states this formally. Forgetting the $-1/n$ is the most common hand-derivation mistake on BN, and is exactly what the formal proof prevents.

The inverse-stddev term

The normalize step divides by $\sqrt{\sigma ^2 + \varepsilon }$, where $\sigma ^2 = \frac{1}{n} \sum _k (x_k - \mu )^2$ is itself a function of every input. Jiggle $x_i$ by $\varepsilon $ and $\sigma ^2$ changes, which means the divisor changes, which means every output $\hat{x}_j$ changes, not just $\hat{x}_i$.

Working through the chain rule ($x \mapsto x^2 \mapsto \text{mean} \mapsto \sqrt{\cdot + \varepsilon } \mapsto 1/\cdot $) gives

\[ \frac{\partial }{\partial x_i} \frac{1}{\sqrt{\sigma ^2 + \varepsilon }} \; =\; -\frac{1}{(\sigma ^2 + \varepsilon )^{3/2}} \cdot \frac{x_i - \mu }{n} \; =\; -\, \mathrm{istd}^3 \cdot \frac{x_i - \mu }{n}. \]

That $\mathrm{istd}^3$ is what makes the BN backward expensive: the gradient of one sample depends on every sample’s centered value, scaled by the cube of the inverse standard deviation. Theorem 33 states this.

Notice what happens to the formula as $\sigma ^2 \to 0$: $\mathrm{istd} \to 1/\sqrt{\varepsilon }$, bounded; without $\varepsilon $ the gradient diverges. Theorem 32 is the formal statement that $\varepsilon {\gt} 0$ is sufficient to make this term differentiable. This is the one place in the book where the math actually requires real-analysis machinery beyond chain-sum-product; everything else in the framework reduces to those three rules.

The three-term backward

Compose the three forward steps and apply the product rule on $\hat{x} = (x - \mu ) \cdot \mathrm{istd}$. The cross-terms collapse (the centered sum $\sum _k (x_k - \mu ) = 0$ is what saves us) and what falls out is a one-line backward:

\[ dx \; =\; \frac{\mathrm{istd} \cdot \gamma }{n}\, \Bigl(\, n\, dy \; -\; \textstyle \sum _k dy_k \; -\; \hat{x} \cdot \textstyle \sum _k (dy_k \, \hat{x}_k)\, \Bigr). \]

Three terms inside the parentheses, one per forward step’s indirect effect:

$n\, dy$—the direct effect, every sample’s upstream gradient.
$-\sum _k dy_k$—the centering correction, subtracts the total upstream gradient because shifting the mean shifts every sample.
$-\hat{x} \cdot \sum _k (dy_k\, \hat{x}_k)$—the normalization correction, subtracts the projection of the upstream gradient onto $\hat{x}$, because rescaling by the inverse stddev couples every sample’s gradient through the shared divisor.

Theorem 34 formalizes this. Every production BN implementation (PyTorch, JAX, TensorFlow, custom CUDA kernels) computes this exact expression. The value of the formal proof is not that we discovered the formula—it has been known since 2015—but that the Lean kernel mechanically verifies we are computing the right thing at every training step.

The affine step is dense-with-broadcasting

The third step, $\gamma \hat{x} + \beta $, is structurally a dense layer applied per-channel. Its Jacobian $\partial (\gamma v + \beta )/\partial v_i = \gamma \delta _{ij}$ is exactly the dense-Jacobian computation from Chapter 2, just lifted to a tensor shape and broadcast across spatial dimensions. Theorem 30 and Theorem 35 are essentially corollaries of the dense theorems; the new content here is zero.

Putting it together

Theorem 36 is the composition: $\mathrm{bn} = \mathrm{affine} \circ \mathrm{normalize}$, with VJPs chained via the same $\mathrm{vjp\_ comp}$ rule from Chapter 1. The 3-term backward is the centerpiece; affine is a corollary; composition is a one-line proof. The structural story of the chapter is: one new Mathlib-level analytic dependency (sqrt and recip differentiability), one formula, and the rest is the same chain rule we already had.

4.1 The theorems

Theorem 30 BN affine step Jacobian

✓

For $\mathrm{bnAffine}(\gamma , \beta )\, v = \lambda i.\; \gamma v_i + \beta $: prove: $\operatorname {pdiv}\bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)\, v\, i\, j = \gamma \, \delta _{ij}$.

Proof ▶

$\operatorname {pdiv}\bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)\, v\, i\, j = \operatorname {pdiv}(y \mapsto \gamma \cdot y)\, v\, i\, j$.
proof: Split $\gamma v_i + \beta $ as $(\gamma \cdot v) + (\text{const } \beta )$: sum rule (Theorem 3) and constant rule (Theorem 6).
$\operatorname {pdiv}(y \mapsto \gamma \cdot y)\, v\, i\, j = \gamma \, \delta _{ij}$.
proof: Factor as $(\text{const } \gamma ) \cdot (\text{identity})$: product rule (Theorem 4); the constant factor’s Jacobian vanishes (Theorem 6) and the identity Jacobian is $\delta _{ij}$ (Theorem 5).
q.e.d.
proof: Chain 1 and 2.

Theorem 31 BN centering Jacobian

✓

For $\mathrm{bnCentered}\, x = \lambda j.\; x_j - \mu (x)$, where $\mu (x) = \frac{1}{n}\sum _s x_s$: prove: $\operatorname {pdiv}(\mathrm{bnCentered})\, x\, i\, j = \delta _{ij} - 1/n$.

Proof ▶

$\operatorname {pdiv}(\mathrm{bnCentered})\, x\, i\, j = \delta _{ij} + \operatorname {pdiv}\bigl(y \mapsto -\tfrac {1}{n}\textstyle \sum _s y_s\bigr)\, x\, i\, j$.
proof: $\mathrm{bnCentered} = \mathrm{id} + \bigl(y \mapsto -(\sum _s y_s)/n\bigr)$: sum rule (Theorem 3), identity Jacobian (Theorem 5).
$\operatorname {pdiv}\bigl(y \mapsto -\tfrac {1}{n}\textstyle \sum _s y_s\bigr)\, x\, i\, j = -\tfrac {1}{n} \cdot \operatorname {pdiv}\bigl(y \mapsto \textstyle \sum _s y_s\bigr)\, x\, i\, j$.
proof: Factor as $(\text{const } -\tfrac {1}{n}) \cdot (\text{sum})$: product rule (Theorem 4); the constant factor’s Jacobian vanishes (Theorem 6).
$\operatorname {pdiv}\bigl(y \mapsto \textstyle \sum _s y_s\bigr)\, x\, i\, j = \sum _s \delta _{is} = 1$.
proof: Finite-sum rule (Theorem 8) over the coordinate projections; each projection is a reindex with Jacobian $\delta _{is}$ (Theorem 7); the Kronecker sum collapses (Finset.sum_ite_eq).
q.e.d.
proof: Chain 1–3: $\delta _{ij} + (-\tfrac {1}{n}) \cdot 1 = \delta _{ij} - 1/n$.

Theorem 32 BN inverse-stddev broadcast smoothness

✓

assume:

$\varepsilon {\gt} 0$ [h$\varepsilon $]

prove: $\operatorname {bnIstdBroadcast}= x \mapsto 1/\sqrt{\sigma ^2(x) + \varepsilon }$ is $\mathsf{Differentiable}$. This is the sqrt/recip smoothness that the product rule needs inside the normalize Jacobian.

Proof ▶

$\sigma ^2(x) + \varepsilon {\gt} 0$ for every $x$, in particular $\neq 0$.
proof: $\sigma ^2(x) \ge 0$ (a sum of squares over $n$); assumption 1 pushes it strictly positive.
$x \mapsto \sigma ^2(x) + \varepsilon $ is differentiable.
proof: Polynomial in the coordinates of $x$ (fun_prop).
$x \mapsto \sqrt{\sigma ^2(x) + \varepsilon }$ is differentiable, and nowhere zero.
proof: Differentiable.sqrt applies away from $0$, which 1 grants; $\sqrt{\cdot }$ of a positive is positive.
q.e.d.
proof: $\mathrm{istd} = (\sqrt{\sigma ^2 + \varepsilon })^{-1}$; Differentiable.inv with the nonvanishing denominator from 3.

Theorem 33 BN inverse-stddev broadcast Jacobian

✓

assume:

$\varepsilon {\gt} 0$ [h$\varepsilon $]

prove: writing $s := \mathrm{istd}(x, \varepsilon ) = 1/\sqrt{\sigma ^2(x) + \varepsilon }$:

\[ \operatorname {pdiv}(\operatorname {bnIstdBroadcast})\, x\, i\, j = -s^3 \cdot (x_i - \mu )/n. \]

Proof ▶

Sketch: the one genuinely analytic Jacobian of the chapter — a Fréchet-derivative chain through variance, $\sqrt{\cdot }$, and $({\cdot })^{-1}$, closed by the $\sum _k (x_k - \mu ) = 0$ identity.

$\sigma ^2(x) + \varepsilon {\gt} 0$, so $\sqrt{\sigma ^2(x) + \varepsilon } {\gt} 0$.
proof: Sum of squares $\ge 0$ plus assumption 1.
$\operatorname {pdiv}(\operatorname {bnIstdBroadcast})\, x\, i\, j = \operatorname {fderiv}_{\mathbb {R}}\, \bigl(x' \mapsto \mathrm{istd}(x', \varepsilon )\bigr)\, x\, (\mathbf{e}_i)$ — the output is constant in $j$, so $\operatorname {pdiv}$ reduces to the scalar derivative.
proof: Definition 1 and fderiv_apply, legitimate by Theorem 32.
Derivative chain. Let $C_k := \mathrm{proj}_k - \tfrac {1}{n}\sum _{i'} \mathrm{proj}_{i'}$ be the centering CLM (so $C_k\, y = y_k - \mu (y)$). Then $\sigma ^2 = \tfrac {1}{n}\sum _k C_k^2$ has Fréchet derivative $\tfrac {1}{n}\sum _k 2\, C_k(x) \cdot C_k$, and composing through $\sqrt{\cdot }$ (HasFDerivAt.sqrt, licensed by 1) and $({\cdot })^{-1}$ (hasDerivAt_inv) gives $\partial \, \mathrm{istd} / \partial \sigma ^2 = -\tfrac {1}{2}\, s^3$.
proof: Product rule per square, summed; the two Mathlib compositions.
Evaluate at $\mathbf{e}_i$: $\partial \sigma ^2 / \partial x_i = \tfrac {2}{n}(x_i - \mu )$.
proof: $C_k(\mathbf{e}_i) = \delta _{ki} - \tfrac {1}{n}$, so the sum in 3 is $\tfrac {2}{n} \sum _k (x_k - \mu )(\delta _{ki} - \tfrac {1}{n})$, which collapses to $\tfrac {2}{n}(x_i - \mu )$ because $\sum _k (x_k - \mu ) = 0$.
q.e.d.
proof: Chain 3 and 4: $-\tfrac {1}{2}\, s^3 \cdot \tfrac {2}{n}(x_i - \mu ) = -s^3 (x_i - \mu )/n$; with 2 this is the goal.

Theorem 34 BN normalize 3-term VJP

✓

assume:

$\varepsilon {\gt} 0$ [h$\varepsilon $]

prove: $\mathsf{HasVJP}\, (\mathrm{bnNormalize})$, with the consolidated three-term backward (writing $s := \mathrm{istd}$, $\hat{x} := \mathrm{bnXhat}$):

\[ B(x, d\hat{x})_i = \tfrac {1}{n}\, s \Bigl( n\, d\hat{x}_i - \sum _j d\hat{x}_j - \hat{x}_i \sum _j \hat{x}_j\, d\hat{x}_j \Bigr). \]

Proof ▶

Sketch: product rule on $\hat{x} = (x - \mu ) \cdot s$ merges the two elementary Jacobians into one formula; contracting with $d\hat{x}$ splits it into the three terms.

Consolidated Jacobian: $\operatorname {pdiv}(\mathrm{bnNormalize})\, x\, i\, j = \tfrac {s}{n}\bigl(n\, \delta _{ij} - 1 - \hat{x}_i \hat{x}_j\bigr)$.
proof: Factor $\hat{x}$ as the elementwise product $\mathrm{bnCentered} \cdot \operatorname {bnIstdBroadcast}$ (bnXhat_eq_product); product rule (Theorem 4) — differentiable because $\mathrm{bnCentered}$ is affine and $\operatorname {bnIstdBroadcast}$ is smooth (Theorem 32, assumption 1); substitute the centering Jacobian (Theorem 31) and the istd Jacobian (Theorem 33); the $\hat{x} = (x - \mu ) \cdot s$ identity plus field_simp/ring collapse the algebra ($n \neq 0$ since $\mathrm{Fin}\, n$ is inhabited by $i$). This step is the Lean lemma pdiv_bnNormalize.
suffices: for all $x$, $d\hat{x}$, $i$: $B(x, d\hat{x})_i = \sum _j \operatorname {pdiv}(\mathrm{bnNormalize})\, x\, i\, j \cdot d\hat{x}_j$.
proof: Definition 9, with $B$ as the candidate backward function.
q.e.d.
proof: Substitute 1 into 2 and split the sum into its three pieces: the $\delta $-term collapses to $n\, d\hat{x}_i$ (Finset.sum_ite_eq), the $-1$ term gives $-\sum _j d\hat{x}_j$, and factoring $\hat{x}_i$ out of the third gives $-\hat{x}_i \sum _j \hat{x}_j d\hat{x}_j$; scale by $s/n$ and this is $B$.

Theorem 35 BN affine VJP

✓

$\mathsf{HasVJP}\, \bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)$: each input feeds one output scaled by $\gamma $, so the gradient comes back scaled by $\gamma $.

Proof ▶

Define $B(v, dy)_i := \gamma \cdot dy_i$.
suffices: for all $v$, $dy$, $i$: $B(v, dy)_i = \sum _j \operatorname {pdiv}\bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)\, v\, i\, j \cdot dy_j$.
proof: Definition 9, with $B$ as the candidate backward function.
q.e.d.
proof: By the affine Jacobian (Theorem 30) the sum is $\sum _j \gamma \, \delta _{ij}\, dy_j = \gamma \, dy_i$.

Theorem 36 Full BN VJP

✓

assume:

$\varepsilon {\gt} 0$ [h$\varepsilon $]

prove: $\mathsf{HasVJP}\, \bigl(\mathrm{bnForward}(\varepsilon , \gamma , \beta )\bigr)$.

Proof ▶

$\mathrm{bnForward} = \mathrm{bnAffine} \circ \mathrm{bnNormalize}$.
proof: Definitional (bnForward_eq_compose).
$\mathrm{bnNormalize}$ is differentiable everywhere.
proof: It is the elementwise product of $\mathrm{bnCentered}$ (affine, hence smooth) and $\operatorname {bnIstdBroadcast}$ (smooth by Theorem 32, assumption 1).
$\mathrm{bnAffine}$ is differentiable everywhere.
proof: Affine (fun_prop).
q.e.d.
proof: VJP chain rule (Theorem 10) with 2, 3 and the two halves (Theorems 34, 35). The composed backward is exactly the two-step MLIR backward: $d\hat{x} = \gamma \cdot dy$, then the three-term formula.

4.2 Example: training dynamics on CIFAR

Chapter 3 did MNIST with two convolutions and a 2$\times $512 dense head. This chapter keeps that exact head and the same conv/ReLU/max-pool machinery, but — CIFAR-10 being a genuinely harder problem — stacks the convolutions four stages deep: eight $3 \times 3$ convolutions in all. Same head, four times the body. That extra depth is the whole point of the chapter: it is where BatchNorm starts to pay, where the optimizer starts to matter, and the same depth ResNet-34 will scale to thirty-four layers. It also lets us ask a sharper question than “does it train?” — what governs how it trains?

Two levers govern the answer — normalization and the optimizer — and a third knob, the width of the head, turns out not to. We move each against an identical, machine-checked gradient, holding everything else fixed (same architecture, same data pipeline — per-epoch shuffle, random horizontal flip, cosine learning-rate schedule with warmup — same 40 epochs). The findings, stated up front so the graphs are no surprise:

Normalization buys speed. BatchNorm (dropped between each conv and its ReLU) reaches a given accuracy in markedly fewer epochs; at convergence it lands level-to-slightly-above the un-normalized net. An accelerator first of all.
The optimizer moves the ceiling most. Trading plain SGD for SGD-with-momentum is worth three to four points of final accuracy — more than anything else here. Adam is roughly a wash with plain SGD; its per-coordinate adaptivity earns little it can keep.
The head barely matters. This 2$\times $512 head carries about 25$\times $ the parameters of a narrow 64-wide one, yet trains to within a point of it — the convolutional body, not the head, is doing the work. That is also why we can borrow MNIST’s head wholesale: it was never the bottleneck.

Every run below shares one proof-rendered backward pass; only the BN layers or the rendered optimizer tail change. The numbers are from that verified training step running on a GPU.

Architecture

The same vertical column as the MNIST CNN in Chapter 3, just deeper: eight $3 \times 3$ convolutions in four stages (a max-pool after each) lift the $32 \times 32 \times 3$ input to a $2 \times 2 \times 32$ feature grid, which flattens to 128 and runs through the same 2$\times $512 dense head as that net (only the first layer changes width, $6272$ vs. $128$, because the deeper stack pools the grid down further). Eight convolutions is four times the MNIST net’s two — and that growth in depth, not width, is the lineage ResNet-34 (Ch 5) carries to thirty-four layers.

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.58cm, minimum width=4.7cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, convbn/.style = {col, draw=orange!65!black, fill=orange!12}, pool/.style = {col, draw=teal!60!black, fill=teal!10}, flat/.style = {col, draw=purple!60!black, fill=purple!8}, dense/.style = {col, draw=orange!65!black, fill=orange!12}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, stage/.style = {font=\sffamily\scriptsize\itshape, gray!55!black, anchor=west}, ] % Vertical layer column: input at top -> logits at bottom. \node[io] (input) {Input \;\; $32\times32\times3$}; \node[convbn, below=0.18cm of input] (c1) {\textbf{ConvBN} $3\to16$, $3\times3$, ReLU}; \node[convbn, below=0.18cm of c1] (c2) {\textbf{ConvBN} $16\to16$, $3\times3$, ReLU}; \node[pool, below=0.18cm of c2] (p1) {\textbf{maxPool} $2\times2$ \;\; $32\to16$}; \node[convbn, below=0.18cm of p1] (c3) {\textbf{ConvBN} $16\to16$, $3\times3$, ReLU}; \node[convbn, below=0.18cm of c3] (c4) {\textbf{ConvBN} $16\to16$, $3\times3$, ReLU}; \node[pool, below=0.18cm of c4] (p2) {\textbf{maxPool} $2\times2$ \;\; $16\to8$}; \node[convbn, below=0.18cm of p2] (c5) {\textbf{ConvBN} $16\to32$, $3\times3$, ReLU}; \node[convbn, below=0.18cm of c5] (c6) {\textbf{ConvBN} $32\to32$, $3\times3$, ReLU}; \node[pool, below=0.18cm of c6] (p3) {\textbf{maxPool} $2\times2$ \;\; $8\to4$}; \node[convbn, below=0.18cm of p3] (c7) {\textbf{ConvBN} $32\to32$, $3\times3$, ReLU}; \node[convbn, below=0.18cm of c7] (c8) {\textbf{ConvBN} $32\to32$, $3\times3$, ReLU}; \node[pool, below=0.18cm of c8] (p4) {\textbf{maxPool} $2\times2$ \;\; $4\to2$}; \node[flat, below=0.18cm of p4] (fl) {flatten \;\; $2\times2\times32 \to 128$}; \node[dense, below=0.18cm of fl] (d1) {\textbf{Dense} $128\to512$, ReLU}; \node[dense, below=0.18cm of d1] (d2) {\textbf{Dense} $512\to512$, ReLU}; \node[head, below=0.18cm of d2] (d3) {\textbf{Dense} $512\to10$ \;(identity)}; \node[logits, below=0.18cm of d3] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/c1, c1/c2, c2/p1, p1/c3, c3/c4, c4/p2, p2/c5, c5/c6, c6/p3, p3/c7, c7/c8, c8/p4, p4/fl, fl/d1, d1/d2, d2/d3, d3/out} \draw[arr] (\a) -- (\b); % Stage brackets on the right. \node[stage] at ($(c1.east)!0.5!(p1.east) + (0.35,0)$) {stage 1}; \node[stage] at ($(c3.east)!0.5!(p2.east) + (0.35,0)$) {stage 2}; \node[stage] at ($(c5.east)!0.5!(p3.east) + (0.35,0)$) {stage 3}; \node[stage] at ($(c7.east)!0.5!(p4.east) + (0.35,0)$) {stage 4}; % Bridge cue: the dense head is exactly the MNIST CNN's. \node[stage, align=left] at ($(d1.east)!0.5!(d2.east) + (0.35,0)$) {= MNIST CNN\\head (Ch~\ref{chap:cnn})}; \end{tikzpicture}$

The ConvBN boxes are the with-BN variant; the no-BN net is identical with each ConvBN replaced by a plain Conv2D + ReLU. Both specs are written out next.

The two specs, differing by one keyword per layer

Eight $3 \times 3$ convolutions in four stages (channel widths $16, 16, 32, 32$, a max-pool after each stage), then the 2$\times $512 dense head lifted straight from the MNIST CNN. Without BN, each convolution is followed by a plain ReLU; with BN, a per-channel BatchNorm sits between the convolution and the ReLU. That one keyword per conv layer is the entire difference.

Without BN:

def cifar8NoBn : NetSpec where
  name := "CIFAR-CNN8-noBN"
  imageH := 32
  imageW := 32
  layers := [
    .conv2d 3  16 3 .same .relu,  .conv2d 16 16 3 .same .relu, .maxPool 2 2,
    .conv2d 16 16 3 .same .relu,  .conv2d 16 16 3 .same .relu, .maxPool 2 2,
    .conv2d 16 32 3 .same .relu,  .conv2d 32 32 3 .same .relu, .maxPool 2 2,
    .conv2d 32 32 3 .same .relu,  .conv2d 32 32 3 .same .relu, .maxPool 2 2,
    .flatten,
    .dense 128 512 .relu, .dense 512 512 .relu, .dense 512 10 .identity
  ]

With BN:

def cifar8Bn : NetSpec where
  name := "CIFAR-CNN8-BN"
  imageH := 32
  imageW := 32
  layers := [
    .convBn 3  16 3 1 .same,  .convBn 16 16 3 1 .same, .maxPool 2 2,
    .convBn 16 16 3 1 .same,  .convBn 16 16 3 1 .same, .maxPool 2 2,
    .convBn 16 32 3 1 .same,  .convBn 32 32 3 1 .same, .maxPool 2 2,
    .convBn 32 32 3 1 .same,  .convBn 32 32 3 1 .same, .maxPool 2 2,
    .flatten,
    .dense 128 512 .relu, .dense 512 512 .relu, .dense 512 10 .identity
  ]

The diff: eight .conv2d $\cdots $ .relu layers become eight .convBn $\cdots $ layers. The dense head, the max-pools, and the training config are all identical. Both nets are verified end to end — forward, gradient, and the rendered StableHLO training step are machine-checked (Proofs/CifarCNN.lean, Proofs/Cifar8Close.lean) — and both run through the same IREE FFI path on the GPU.

Lever 1: normalization

Fix the optimizer at plain SGD and toggle BN. Both nets train for 40 epochs on the shared pipeline; per-epoch test accuracy, from the verified runs (runs/ablation_cifar8/{nobn,bn}_sgdsched.log):

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Test accuracy (\%)}, xmin=0, xmax=41, ymin=28, ymax=78, xtick={0,5,10,15,20,25,30,35,40}, ytick={30,40,50,60,70}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,36.51) (2,44.34) (3,50.14) (4,54.30) (5,60.84) (6,63.31) (7,62.39) (8,66.26) (9,66.06) (10,67.92) (11,69.88) (12,70.40) (13,71.39) (14,68.93) (15,71.61) (16,72.27) (17,72.36) (18,71.83) (19,73.11) (20,73.41) (21,73.02) (22,73.87) (23,73.72) (24,72.99) (25,73.69) (26,73.23) (27,73.66) (28,73.87) (29,74.08) (30,73.91) (31,73.85) (32,74.14) (33,74.22) (34,74.20) (35,74.24) (36,74.12) (37,74.28) (38,74.24) (39,74.15) (40,74.23) }; \addlegendentry{with BN} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (1,33.90) (2,28.36) (3,34.98) (4,47.56) (5,51.55) (6,51.41) (7,57.98) (8,58.01) (9,58.40) (10,62.19) (11,62.64) (12,64.23) (13,65.03) (14,64.50) (15,64.94) (16,67.52) (17,68.46) (18,68.73) (19,68.52) (20,70.13) (21,70.42) (22,69.69) (23,70.82) (24,70.15) (25,70.32) (26,70.59) (27,71.44) (28,71.06) (29,71.97) (30,71.99) (31,71.61) (32,72.16) (33,72.26) (34,71.96) (35,72.34) (36,72.24) (37,72.40) (38,72.42) (39,72.39) (40,72.39) }; \addlegendentry{no BN} \end{axis} \end{tikzpicture}$

CIFAR-10, wide 8-conv net, plain SGD (lr 0.1, cosine${+}$warmup) on the shared pipeline, 40 epochs, BN vs no-BN, through the verified renderer (runs/ablation_cifar8w/{nobn,bn}.log, the SGD phase).

BN leads from the second epoch — 44% to the bare net’s 28% — and keeps the lead the whole way, finishing about two points up (74% vs 72%); the un-normalized net traces the same arc roughly a dozen epochs behind. Most of that is speed (the curves are the same shape, shifted left), but at this depth on plain SGD it is also a small, real convergence margin — the conditioning the three-term backward (§34) buys is worth more the deeper the stack, exactly the trend that makes BN standard equipment by ResNet’s thirty-four layers. That margin is largest precisely for plain SGD; under momentum and Adam (next) both nets already train well enough that BN’s lead all but vanishes.

Lever 2: the optimizer

Now hold the architecture fixed and change only the update rule. Three optimizers, each at its own tuned learning rate, all on the identical pipeline and the same 40 epochs (final % / best %):

	SGD	momentum	AdamW
	(lr 0.1)	($\mu $ 0.9, lr 0.02)	(lr $10^{-3}$)
no BN	72.4 / 72.4	76.7 / 76.9	73.6 / 73.7
BN	74.2 / 74.3	77.1 / 77.1	73.3 / 73.4

Momentum wins outright — three to four points over plain SGD in both nets, and the best result on the board (BN${+}$momentum, 77%). Plain SGD and Adam are a rough tie: Adam’s per-coordinate second-moment scaling earns nothing it can keep here, even with the wide head’s extra parameters to exploit. And reading down the columns recovers Lever 1 — BN helps most under plain SGD ($\approx {+}1.8$) and its margin shrinks to nothing under momentum and Adam, which already condition the gradient well. Normalization, in other words, mostly compensates for a weak optimizer; give it a good one and the depth alone carries the net.

The Lever-1 graph drew this for the SGD column — fixed optimizer, BN toggled. The same per-epoch picture for the other two columns, on the same axes, is what reading down the table looks like drawn out:

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, title={Nesterov momentum ($\mu$ 0.9, lr 0.02)}, title style={font=\small}, xlabel={Epoch}, ylabel={Test accuracy (\%)}, xmin=0, xmax=41, ymin=28, ymax=78, xtick={0,5,10,15,20,25,30,35,40}, ytick={30,40,50,60,70}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,43.86) (2,55.08) (3,60.85) (4,64.31) (5,67.17) (6,70.11) (7,71.45) (8,72.60) (9,72.77) (10,73.21) (11,74.21) (12,73.47) (13,74.56) (14,74.60) (15,75.14) (16,75.62) (17,75.57) (18,75.62) (19,76.66) (20,76.63) (21,76.23) (22,76.19) (23,75.88) (24,76.01) (25,76.41) (26,75.99) (27,76.85) (28,76.68) (29,76.70) (30,76.63) (31,76.65) (32,76.79) (33,76.80) (34,76.99) (35,77.04) (36,76.92) (37,76.99) (38,76.93) (39,77.01) (40,77.08) }; \addlegendentry{with BN} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (1,39.65) (2,49.78) (3,54.17) (4,59.18) (5,64.09) (6,65.48) (7,67.74) (8,68.40) (9,67.12) (10,69.62) (11,70.27) (12,72.04) (13,71.82) (14,71.47) (15,71.36) (16,72.83) (17,73.02) (18,73.77) (19,73.16) (20,74.77) (21,75.10) (22,74.88) (23,75.34) (24,75.44) (25,76.24) (26,75.53) (27,76.10) (28,76.68) (29,76.68) (30,76.76) (31,76.85) (32,76.60) (33,76.69) (34,76.60) (35,76.78) (36,76.61) (37,76.55) (38,76.63) (39,76.77) (40,76.70) }; \addlegendentry{no BN} \end{axis} \end{tikzpicture}$

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, title={AdamW (lr $10^{-3}$)}, title style={font=\small}, xlabel={Epoch}, ylabel={Test accuracy (\%)}, xmin=0, xmax=41, ymin=28, ymax=78, xtick={0,5,10,15,20,25,30,35,40}, ytick={30,40,50,60,70}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,33.63) (2,44.98) (3,51.91) (4,57.84) (5,60.84) (6,63.50) (7,65.32) (8,65.34) (9,67.08) (10,68.37) (11,68.88) (12,69.65) (13,70.75) (14,70.49) (15,72.12) (16,72.41) (17,71.72) (18,72.01) (19,71.86) (20,72.54) (21,72.78) (22,73.29) (23,72.56) (24,73.19) (25,72.32) (26,73.04) (27,73.30) (28,73.20) (29,73.15) (30,73.21) (31,73.36) (32,73.30) (33,73.42) (34,73.37) (35,73.38) (36,73.29) (37,73.40) (38,73.41) (39,73.36) (40,73.35) }; \addlegendentry{with BN} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (1,40.87) (2,47.06) (3,53.37) (4,58.26) (5,59.70) (6,61.16) (7,63.41) (8,65.70) (9,65.20) (10,68.51) (11,68.28) (12,67.93) (13,69.26) (14,70.59) (15,70.54) (16,71.65) (17,72.09) (18,71.44) (19,71.59) (20,73.10) (21,73.06) (22,73.27) (23,73.46) (24,73.00) (25,72.68) (26,73.11) (27,72.95) (28,73.45) (29,73.58) (30,73.21) (31,73.44) (32,73.51) (33,73.58) (34,73.70) (35,73.45) (36,73.66) (37,73.52) (38,73.52) (39,73.57) (40,73.57) }; \addlegendentry{no BN} \end{axis} \end{tikzpicture}$

Same wide 8-conv net, same shared pipeline, same 40 epochs — only the optimizer changes from the SGD panel above (runs/ablation_cifar8w/{nobn,bn}.log, the momentum and AdamW phases). Under momentum BN still takes the early lead — 44% to 40% at epoch 1, and a visible margin through the first dozen epochs — but the un-normalized net pulls level around epoch 28 and both finish at 77%, BN ahead by only a couple of tenths. Under AdamW even that early gap is gone: the two curves are tangled from the first epoch and the no-BN net actually ends a hair in front ($73.6$ vs $73.4$), because Adam’s per-coordinate second-moment scaling already conditions the step that BN would otherwise smooth. Read top-to-bottom — SGD, then momentum, then Adam — the blue–orange gap collapses panel by panel: the same “BN helps most under a weak optimizer” story the table tells in two columns of numbers, now drawn epoch by epoch.

The reason we can make this comparison and trust it is that the optimizer is one swappable rendered tail. The forward pass, the softmax–cross-entropy loss, the backward pass, and every parameter gradient are the same proof-rendered graph in each column; only the final per-parameter update op changes:

SGD: $\theta \leftarrow \theta - \mathrm{lr}\cdot g$.
Momentum (Nesterov): $v \leftarrow \mu v + g$, then $\theta \leftarrow \theta - \mathrm{lr}\, (\mu v + g)$.
AdamW: the bias-corrected first/second-moment step, rendered op-for-op as Proofs.adamWParam.

Each tail is emitted onto the same certified gradient (emitSgd, emitMomentum, ViTRender.emitAdamV), so the ablation is honest in the strong sense: identical, machine-checked gradients with different arithmetic stacked on top — and for plain SGD, that the binary32 step actually decreases the loss is itself a proved theorem.

One honest note on the comparison. The first time we ran this, the optimizers looked far more separated — momentum and Adam beat SGD by some ten points. That gap was mostly an artifact: the momentum and Adam runs happened to go through a data pipeline with shuffling and flip augmentation that the plain-SGD baseline lacked. Holding the pipeline genuinely fixed — the table above — collapsed the difference to the few points that are really about the optimizer. It is exactly the kind of mistake a verified gradient does not catch: the math was correct in every run; the experiment design was what needed fixing. An ablation measures the thing you varied only if everything else is held constant — and “everything else” includes the parts that live outside the network.

Does the head width matter?

Almost not at all — which is exactly why we could borrow MNIST’s head untouched. The 2$\times $512 head carries about $334{,}000$ of the net’s $374{,}000$ parameters; swap it for a narrow 64-wide head ($13{,}000$ params, whole net down to $53{,}000$) and the same six runs land within a point either way — best-of-board $77.1\% $ wide vs $77.2\% $ narrow, with the optimizer ordering unchanged. Seven times the parameters buys no accuracy, and only about $1.4\times $ the wall-clock per epoch (${\approx }14$ vs ${\approx }10$ seconds on a gfx1100): the eight convolutions over the $32 \times 32$ maps dominate the compute, while the head — big as it is — is cheap matmul. That is the lesson the bridge makes concrete: at this scale the depth of the convolutional body is the lever, not the width of the classifier on top. It is why the head can stay MNIST’s, and why the next chapter spends its budget on more convolution — thirty-four layers of it — not a bigger head.

The two-point comparison above (64 vs 512) is worth drawing out in full, because the parametric renderer makes it cheap: cifar8-bn-grid holds the eight-convolution backbone fixed and renders the AdamW train step at any head width $d$ (the same $D_1$ that was hard-wired to 64 is now a parameter of the verified emitter), so we can sweep $d$ from 8 to 4096 and train each point on its own proof-rendered StableHLO. Split across the two gfx1100s, the whole curve is one short run:

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.2cm, xlabel={Dense-head width $d$ (both head layers; backbone fixed)}, ylabel={CIFAR-10 test accuracy (\%)}, xmode=log, log basis x=2, xmin=6.5, xmax=5000, ymin=65.5, ymax=73, xtick={8,16,32,64,128,256,512,1024,2048,4096}, xticklabels={8,16,32,64,128,256,512,1024,2048,4096}, ytick={66,68,70,72}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1.5pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (8,66.78) (16,71.10) (32,71.70) (64,71.07) (128,71.43) (256,71.49) (512,71.55) (1024,72.17) (2048,71.60) (4096,71.73) }; \addplot[only marks, mark=o, mark size=4pt, red, line width=1pt, forget plot] coordinates {(64,71.07)}; \node[anchor=west, font=\footnotesize, red!70!black] at (axis cs:74,69.4) {canonical head $d{=}64$}; \end{axis} \end{tikzpicture}$

Dense-head width sweep for the 8-conv cifar8 net with per-channel BatchNorm, AdamW, 25 epochs, conv backbone held at $[16,16,32,32]$ (runs/cifar8bn_grid_results.tsv). Past a 16-wide head the curve is essentially flat — everything from $d{=}16$ to $d{=}4096$ lands inside a single point ($71.1$ to $72.2\% $), and the $256\times $ wider head buys nothing, while its $17$M-parameter classifier just overfits (train loss $0.03$, test unmoved). The only real drop is at $d{=}8$ ($66.8\% $), where the $128{\to }8$ first layer throttles the 128-dim feature map the convolutions produce. The canonical $d{=}64$ (circled) sits squarely on the plateau: the head width genuinely does not matter here, which is the whole reason the net could borrow MNIST’s classifier untouched. (Absolute accuracy is a couple of points below the 40-epoch board above — this is a 25-epoch single-optimizer sweep — but the shape is the point.)

Why the levers work

Both levers do the same underlying thing — they make each step’s gradient a more reliable guide to the next — by attacking different sources of noise. Each layer is tuned for the distribution of its inputs, but those inputs are the outputs of every layer below, which shift every step; so each layer chases a moving target, and the step that helps one layer can wreck the next. BN removes the moving part by pinning every layer’s input to mean-zero, unit-variance before the learnable $\gamma ,\beta $ get a say — Ioffe & Szegedy framed this as reducing “internal covariate shift,” while Santurkar et al. (2018) argued the sharper effect is a smoother loss landscape. Momentum attacks a different noise: by averaging successive gradients it cancels the mini-batch jitter and accumulates the consistent direction. Both make the per-step direction more trustworthy, and reliable progress compounds across epochs — precisely what the curve and the table measure. (On larger networks, and with true batch-statistic normalization, BN’s effect is stronger still — it opens up learning rates that diverge without it — but at this depth, what we can honestly measure and verify is the speed, so that is the claim we make here.) That is why every image architecture since 2015 bakes a normalization layer in by default, and why the optimizer is the first knob any practitioner reaches for: between them they decide not whether the net can learn this data, but how many passes it takes.

4.3 MLIR: BatchNorm

What is already proven. BatchNorm factors as $\mathrm{bnForward} = \mathrm{bnAffine} \circ \mathrm{bnNormalize}$, and its reverse-mode derivative is the three-term formula of § 34: with $\hat{x} = (x-\mu )\, \mathrm{istd}$,

\[ dx = \frac{\mathrm{istd}\, \gamma }{n}\Bigl(n\, dy \; -\; \textstyle \sum _k dy_k \; -\; \hat{x}\, \textstyle \sum _k \hat{x}_k\, dy_k\Bigr). \]

bn_has_vjp proves it, composing bnNormalize_has_vjp (the rank-1 wringer) with bnAffine_has_vjp (the $\gamma \, dy$ half). The one subtlety, isolated in § 32, is that the inverse-stddev term carries an $\mathrm{istd}^3$ that needs $\varepsilon {\gt} 0$ to stay differentiable — the single place in the book where the math reaches past chain-sum-product into real analysis.

The gap and how we close it. The three-term backward is not elementwise — the two $\sum _k$ reductions couple every coordinate to every other. The emitted backward graph is given a denotation in the proofs’ own vector type, and bn_back_bridge proves that denotation equal to bn_has_vjp’s backward — the emitted reduce/broadcast/elementwise graph is, by machine check, the three-term formula. Here is what the printer emits (the forward-statistic recompute — mean, variance, rsqrt, normalize — elided to its one comment line; $n=4$):

// forward stats recomputed: %mu, %istd, %xhat = (x-mu)*istd
%dxhat = stablehlo.multiply %gb, %dy : tensor<2x4xf32>   // dxhat = g*dy
%sdx_r = stablehlo.reduce(%dxhat init: %sc)
           applies stablehlo.add across dimensions = [1]
           : (tensor<2x4xf32>, tensor<f32>) -> tensor<2xf32>
%sdx = stablehlo.broadcast_in_dim %sdx_r, dims = [0]
           : (tensor<2xf32>) -> tensor<2x4xf32>          // sum dxhat
%xd = stablehlo.multiply %xhat, %dxhat : tensor<2x4xf32>
%sxdx_r = stablehlo.reduce(%xd init: %sc)
           applies stablehlo.add across dimensions = [1]
           : (tensor<2x4xf32>, tensor<f32>) -> tensor<2xf32>
%sxdx = stablehlo.broadcast_in_dim %sxdx_r, dims = [0]
           : (tensor<2xf32>) -> tensor<2x4xf32>          // sum xhat*dxhat
%t1 = stablehlo.multiply %dxhat, %nf : tensor<2x4xf32>   // N*dxhat
%i1 = stablehlo.subtract %t1, %sdx : tensor<2x4xf32>     //   - sum dxhat
%xs = stablehlo.multiply %xhat, %sxdx : tensor<2x4xf32>
%i2 = stablehlo.subtract %i1, %xs : tensor<2x4xf32>      //   - xhat*(sum)
%s = stablehlo.divide %istd, %nf : tensor<2x4xf32>       // istd/N
%dx = stablehlo.multiply %s, %i2 : tensor<2x4xf32>
return %dx : tensor<2x4xf32>

Read it against the formula: %dxhat is the affine backward $\gamma \, dy$; each reduce along dimensions = [1] followed by a broadcast_in_dim is one of the cross-coordinate sums ($\sum _k dy_k$ as %sdx, $\sum _k\hat{x}_k\, dy_k$ as %sxdx); %i1 assembles $n\, dy - \sum dy$ (the direct term minus the centering correction), %i2 subtracts the rank-1 normalization correction $\hat{x}\sum \hat{x}\, dy$, and %dx scales the whole bracket by $\mathrm{istd}/n$. The graph folds $\gamma $ into %dxhat up front, which is why that leading scale is $\mathrm{istd}/n$ and not $\mathrm{istd}\, \gamma /n$ — the same formula, $\gamma $ pulled inside the parenthesis. The bridge theorem is precisely the claim that this text computes bn_has_vjp’s backward.

Because LayerNorm is BatchNorm along a different axis, the very same emitted graph denotes the LayerNorm backward (layernorm_back_bridge is literally bn_back_bridge) — the normalization sitting inside the residual, depthwise, and ConvNeXt blocks built on this foundation.

Caveats.

Needs $\varepsilon {\gt} 0$. The inverse-stddev term carries an $\mathrm{istd}^3$; given $\varepsilon {\gt} 0$ the bridge is unconditional (smooth everywhere, with no smooth-point exclusion, unlike ReLU and max-pool).
Representative scale ($n = 4$).

The next chapter (§ 5) adds residual connections and the same mechanical approach: prove that the VJP of a skip connection is additive fan-in, compose with BN and conv, and the rest of the ResNet family falls out without introducing any new math.