9 Vision Transformer

Where Part 1 has been heading

Every previous chapter has introduced a new architectural primitive and proved its backward pass. Convolution, BatchNorm, residual, depthwise, SE, LayerNorm, GELU—each one slotted into the NetSpec type and earned its own has_vjp theorem. This chapter does the same for attention, but it also does something larger: it composes every layer the framework has proved so far into one machine-checked backward pass over a complete state-of-the-art image classifier.

The destination is Theorem 84 at the end of the chapter. It says, roughly: the full ViT backbone—patch embedding, twelve transformer blocks, final LayerNorm—is a $\mathsf{HasVJPMat}$. The proof composes everything this chapter introduces, plus most of what the previous chapters introduced, into a single tower of vjpMat_comp applications. Read the proof’s dependency list and you see a tour of Part 1: dense, residual, layer norm, GELU, softmax, matrix multiply, plus the eight or ten matrix-machinery lemmas this chapter sets up. None of them is new in its own right; what’s new is how high the composition rule can be pushed.

Attention in one equation

A transformer encoder block does two things to its input sequence $X \in \mathbb {R}^{N \times d}$ (think of $N$ image patches, each represented as a $d$-vector):

Multi-head self-attention (MHSA): each output row is a weighted average of all input rows, where the weights are computed from the input itself.
Per-token MLP: a small two-layer MLP applied independently to every row.

Both halves are wrapped in a LayerNorm + residual sandwich (exactly the same template as ConvNeXt’s blocks from Chapter 8; the layout transferred to convnets in 2022 actually came from transformers in 2017).

The interesting half is the first one. Scaled dot-product attention computes, for queries $Q$, keys $K$, and values $V$ (all linear projections of $X$):

\[ \mathrm{Attention}(Q, K, V) \; =\; \mathrm{softmax}\! \left(\frac{Q K^\top }{\sqrt{d_h}}\right) V. \]

In words: every row of $Q$ is compared to every row of $K$ via inner product (the $QK^\top $ matrix is $N \times N$, one entry per query-key pair). Each row of that $N \times N$ matrix is passed through softmax to get a probability distribution over the keys (which positions to attend to). Those probabilities then weight the rows of $V$ to produce the output. The $1 / \sqrt{d_h}$ scaling keeps the softmax from saturating at large $d_h$. Multi-head attention just runs several copies of this in parallel on lower-dimensional projections, concatenating the outputs.

Each token—meaning each row of $X$—ends up looking at every other token. That’s the qualitative break from CNNs: convolutions look only at a local neighborhood, attention looks everywhere. The receptive field is global from the first layer.

Why this chapter has thirty theorems

The math of attention is matrix algebra. Chapter 1’s foundation rules—chain, sum, product, identity—were stated and proved for vector functions. Before we can prove backward passes through $Q K^\top $ and the row-wise softmax, we need the matrix-level lifts of those rules: chain rule on matrices, additive fan-in on matrices, identity on matrices, plus matrix-specific moves like transpose, matmul-with-one-factor-fixed, and scalar scale. That’s the first section of theorems below (Matrix-level machinery, $\sim $15 theorems). Every one of them is a mechanical lift of a vector-level theorem from Chapter 1, proved via the row-projection $\mathrm{ContinuousLinearMap}$ and the existing chain rule. Pay the cost once, reuse forever.

The second section (Attention proofs, $\sim $15 theorems) applies the matrix machinery to the actual attention math. Row-wise softmax—a per-row softmax applied to the $N \times N$ score matrix—has the well-known closed-form Jacobian $p_i(\delta _{ij} - p_i)$; we prove the row-wise lifting that says this holds across matrices. Scaled dot-product attention’s backward decomposes into three matmul-backward applications (one for each of $Q$, $K$, $V$) plus the row-wise softmax backward; we prove each. Then multi-head, then the attention sublayer (with residual), then the MLP sublayer (with residual), then a single transformer block, then a $k$-deep transformer tower, then the full ViT body. Each step is one or two compositions of the previous step. The chain that started at vjp_comp in Chapter 1 reaches all the way to a 5.5M-parameter image classifier here.

What’s actually proved

That’s the framework-side payoff of this chapter: every architectural piece used by any Part-1 network has a machine-checked backward. Once we land Theorem 84, the sentence “the gradient computed by this trainer is the mathematically correct one” is no longer a hope or a folklore result—it is a theorem the Lean kernel verifies on every build.

The empirical-side payoff comes after the theorems, in the Example section: ViT-Tiny trained on Imagenette underperforms the ConvNets, fairly dramatically, because 9,469 training images is too small for a transformer’s lack of inductive bias. That’s informative on its own terms (the data-hungry-transformer story made concrete) and sets up the augmentation ablation that closes the chapter.

9.1 Matrix-level machinery

Attention is fundamentally matrix algebra: queries times keys transposed ($QK^T$), softmax over rows of the resulting matrix, matmul against values ($\cdot V$). Before we can prove backward passes through these operations, we need the matrix-level extensions of Ch 1’s foundation rules. Each theorem below is the matrix lift of a vector-level theorem from Ch 1, proved by composing the vector version with the row-projection $\mathrm{ContinuousLinearMap}$. The Lean proofs all live in Proofs/Tensor.lean alongside their vector counterparts; we collect them here because Ch 9 is the first and only chapter to use them.

Definition 48 Matrix VJP record

✓

The rank-2 analogue of Definition 9. Define $\operatorname {pdivMat}f\, A\, (i,j)\, (k,l)$ as $\operatorname {pdiv}$ of the row-major flattened map $v \mapsto \mathrm{flatten}\bigl(f(\mathrm{unflatten}\, v)\bigr)$ with both index pairs encoded through the $(i, j) \leftrightarrow \text{flat}$ bijection. Then $\mathsf{HasVJPMat}\, f$ bundles a backward function $B$ with: for all $A$, $dY$, $(i, j)$,

\[ B(A, dY)_{ij} = \sum _{k, l} \operatorname {pdivMat}f\, A\, (i,j)\, (k,l) \cdot dY_{kl}. \]

Theorem 49 Row-independence for matrices

✓

assume:

$g : \mathbb {R}^{n} \to \mathbb {R}^{p}$ is differentiable everywhere [h_g_diff]

prove: applying $g$ to each row of a matrix has block-diagonal Jacobian: $\operatorname {pdivMat}(\text{rowwise } g)\, A\, (i,j)\, (k,l) = [i = k] \cdot \operatorname {pdiv}g\, (A_i)\, j\, l$.

Proof ▶

Coordinate factoring: the $(k, l)$ output coordinate of the flattened rowwise map is $(y \mapsto g(y)_l) \circ \mathrm{rowProj}_k$, where $\mathrm{rowProj}_k$ is the CLM extracting row $k$ of the flat vector.
proof: Unfold flatten/unflatten; $\mathrm{rowProj}_k$ is a reindex, hence a CLM.
Every output coordinate — hence the whole flat map — is differentiable, so $\operatorname {fderiv}$ does not fall back to its junk default.
proof: Assumption 1 composed with the CLM of step 1 (differentiableAt_pi).
q.e.d.
proof: Chain rule through the CLM: the derivative of $(y \mapsto g(y)_l) \circ \mathrm{rowProj}_k$ at $\mathbf{e}_{(i,j)}$ sees $\mathrm{rowProj}_k(\mathbf{e}_{(i,j)}) = [i = k]\, \mathbf{e}_j$, so by Definition 1 the entry is $[i = k] \cdot \operatorname {pdiv}g\, (A_i)\, j\, l$.

Theorem 50 Matrix-level chain rule

✓

assume:

$B_F$, $B_G$ are correct backward functions ($\mathsf{HasVJPMat}\, F$, $\mathsf{HasVJPMat}\, G$) [hF, hG]
the flattened forms of $F$ and $G$ are differentiable everywhere [hF_diff, hG_diff]

prove: $\mathsf{HasVJPMat}\, (G \circ F)$.

Proof ▶

Sketch: the VJP chain rule (Theorem 10) transcribed to rank-2 indices; every step is the same, with double sums where the rank-1 proof had single sums.

Define $B(A, dY) := B_F\bigl(A,\; B_G(F(A),\, dY)\bigr)$.
suffices: $B$ matches the $\operatorname {pdivMat}$ contraction.
proof: Definition 48.
Chain rule for $\operatorname {pdivMat}$: $\operatorname {pdivMat}(G \circ F)\, A\, (i,j)\, (k,l) = \sum _{p,q} \operatorname {pdivMat}F\, A\, (i,j)\, (p,q) \cdot \operatorname {pdivMat}G\, (F A)\, (p,q)\, (k,l)$.
proof: The flattened composite is the composite of the flattened maps (unflatten $\circ $ flatten cancels), so the rank-1 chain rule (Theorem 2) applies at the flat level, applicable by assumption 2; re-index the flat middle sum to $(p, q)$ pairs.
q.e.d.
proof: Expand the two correct fields in 1, substitute 3, and swap the two double sums (pack indices into pairs, Finset.sum_comm, unpack); with 2 this is the goal — step for step the proof of Theorem 10.

Theorem 51 Matrix-level additive fan-in

✓

Rank-2 transcription of Theorem 11: given $\mathsf{HasVJPMat}$ for $F$ and $G$ (with flat differentiability), $\mathsf{HasVJPMat}\, (F + G)$ with backward $B_F(A, dY) + B_G(A, dY)$.

Proof ▶

suffices: the summed backward matches the $\operatorname {pdivMat}$ contraction.
proof: Definition 48.
q.e.d.
proof: Expand both correct fields, merge the double sums (Finset.sum_add_distrib twice); the sum rule for $\operatorname {pdivMat}$ — Theorem 3 through the flatten bijection — rewrites $\operatorname {pdivMat}F + \operatorname {pdivMat}G = \operatorname {pdivMat}(F + G)$.

Theorem 52 Matrix-level identity

✓

$\mathsf{HasVJPMat}\, (\mathrm{id})$, backward $B(A, dY) = dY$.

Proof ▶

$\operatorname {pdivMat}(\mathrm{id})\, A\, (i,j)\, (k,l) = [i = k \wedge j = l]$.
proof: Flatten $\circ $ unflatten is the identity on the flat vector, so the identity Jacobian (Theorem 5) applies; the flat-index equality decodes to $i = k \wedge j = l$ by injectivity of the pairing.
q.e.d.
proof: Contract 1 with $dY$: the two-dimensional Kronecker sum collapses (Finset.sum_eq_single twice) to $dY_{ij}$; Definition 48 closes.

Theorem 53 Scalar-scale Jacobian

✓

Phase 6 derivation. prove: $\operatorname {pdivMat}(M \mapsto s \cdot M)\, A\, (i,j)\, (k,l) = [i = k \wedge j = l] \cdot s$.

Proof ▶

The flattened scalar-scale map reduces to $v \mapsto s \cdot v$.
proof: Flatten/unflatten round-trip, pointwise.
$\operatorname {pdiv}(v \mapsto s \cdot v) = s \cdot \delta $ at the flat indices.
proof: Factor as $(\text{const } s) \cdot (\text{identity})$: product rule (Theorem 4), constant rule (Theorem 6), identity Jacobian (Theorem 5).
q.e.d.
proof: The flat-index equality decodes to $i = k \wedge j = l$ by injectivity of the pairing.

Theorem 54 Transpose Jacobian

✓

prove: $\operatorname {pdivMat}(\mathrm{transpose})\, A\, (i,j)\, (k,l) = [j = k \wedge i = l]$.

Proof ▶

The flattened transpose is a pure gather: at output index $\mathrm{idx}$, it reads $v$ at the index with components swapped.
proof: Unfold transpose/flatten/unflatten.
q.e.d.
proof: Reindex Jacobian (Theorem 7) with the swap map $\sigma $: the indicator condition $\mathrm{flat}(i,j) = \mathrm{flat}(l,k)$ decodes to $j = k \wedge i = l$ by injectivity of the pairing.

Theorem 55 Matmul Jacobian, left factor fixed

✓

Phase 6: derived, not axiomatized. prove: $\operatorname {pdivMat}(B' \mapsto C \cdot B')\, B\, (i,j)\, (k,l) = [l = j] \cdot C_{ki}$.

Proof ▶

The flattened map at output index $\mathrm{idx}$ is $v \mapsto \sum _s C_{k(\mathrm{idx}), s} \cdot v_{\mathrm{flat}(s,\, l(\mathrm{idx}))}$ — a finite sum of (constant) $\times $ (reindex) terms.
proof: Unfold $\mathrm{Mat.mul}$/flatten/unflatten.
Per-index Jacobian: finite-sum rule (Theorem 8) distributes over $s$; product rule with the $C$-factor constant (Theorems 4, 6) and reindex Jacobian (Theorem 7) leave $\sum _s C_{k, s} \cdot [\mathrm{flat}(i,j) = \mathrm{flat}(s, l)]$.
q.e.d.
proof: The indicator forces $s = i$ and $l = j$ (injectivity of the pairing), collapsing the sum to $[l = j] \cdot C_{ki}$.

Theorem 56 Matmul Jacobian, right factor fixed

✓

prove: $\operatorname {pdivMat}(A' \mapsto A' \cdot D)\, A\, (i,j)\, (k,l) = [i = k] \cdot D_{jl}$.

Proof ▶

Mirror image of Theorem 55: the flattened map at output $\mathrm{idx}$ is $v \mapsto \sum _s v_{\mathrm{flat}(k(\mathrm{idx}),\, s)} \cdot D_{s,\, l(\mathrm{idx})}$, the same finite-sum-of-reindex-times-constant shape with the variable factor on the left; the same three steps (finite-sum, product + constant + reindex rules, indicator collapse at $s = j$, $k = i$) give $[i = k] \cdot D_{jl}$.

Theorem 57 Matmul VJP, left factor fixed

✓

$\mathsf{HasVJPMat}\, (B' \mapsto C \cdot B')$, backward $dB = C^{T} \cdot dY$.

Proof ▶

Define $B(B', dY)_{ij} := \sum _k C_{ki}\, dY_{kj}$ (that is, $C^T \cdot dY$).
suffices: $B$ matches the $\operatorname {pdivMat}$ contraction.
proof: Definition 48.
q.e.d.
proof: Substitute the Jacobian (Theorem 55): $\sum _{k,l} [l = j]\, C_{ki} \cdot dY_{kl}$ collapses at $l = j$ (Finset.sum_ite_eq’) to $\sum _k C_{ki}\, dY_{kj} = B$.

Theorem 58 Matmul VJP, right factor fixed

✓

$\mathsf{HasVJPMat}\, (A' \mapsto A' \cdot D)$, backward $dA = dY \cdot D^{T}$.

Proof ▶

Define $B(A', dY)_{ij} := \sum _l dY_{il}\, D_{jl}$ (that is, $dY \cdot D^T$).
suffices: $B$ matches the $\operatorname {pdivMat}$ contraction.
proof: Definition 48.
q.e.d.
proof: Substitute the Jacobian (Theorem 56): $\sum _{k,l} [i = k]\, D_{jl} \cdot dY_{kl}$ collapses at $k = i$ (Finset.sum_ite_eq) to $\sum _l D_{jl}\, dY_{il} = B$.

Theorem 59 Scalar-scale VJP

✓

$\mathsf{HasVJPMat}\, (M \mapsto s \cdot M)$, backward $dA = s \cdot dY$.

Proof ▶

By Definition 48 it suffices to contract the Jacobian (Theorem 53) with $dY$: $\sum _{k,l} [i = k \wedge j = l]\, s \cdot dY_{kl}$ collapses at $(k, l) = (i, j)$ to $s \cdot dY_{ij}$. q.e.d.

Theorem 60 Transpose VJP

✓

$\mathsf{HasVJPMat}\, (\mathrm{transpose})$, backward $dA = dY^{T}$.

Proof ▶

By Definition 48 it suffices to contract the Jacobian (Theorem 54) with $dY$: $\sum _{k,l} [j = k \wedge i = l] \cdot dY_{kl}$ collapses at $(k, l) = (j, i)$ to $dY_{ji}$. q.e.d.

Theorem 61 Row-wise lifting of any $\mathsf{HasVJP}$

✓

Phase 8 generic helper. assume:

$B_g$ is a correct backward for $g : \mathbb {R}^{n} \to \mathbb {R}^{p}$ ($\mathsf{HasVJP}\, g$) [hg]
$g$ is differentiable everywhere [hg_diff]

prove: $\mathsf{HasVJPMat}$ of the rowwise map on $\mathbb {R}^{m \times n} \to \mathbb {R}^{m \times p}$, with backward applying $B_g$ to each row: $B(A, dY)_r = B_g(A_r, dY_r)$.

Proof ▶

suffices: $B$ matches the $\operatorname {pdivMat}$ contraction.
proof: Definition 48.
The Jacobian is block-diagonal: $\operatorname {pdivMat}= [i = k] \cdot \operatorname {pdiv}g\, (A_i)\, j\, l$.
proof: Row-independence (Theorem 49), applicable by assumption 2.
q.e.d.
proof: Contract 2 with $dY$: the row sum collapses at $k = i$ (Finset.sum_ite_eq); what remains is precisely $g$’s own correctness equation at row $i$ (assumption 1).

Theorem 62 3D chain rule

✓

prove: for flattened-differentiable $f$, $g$: $\operatorname {pdiv}_3(g \circ f)$ is the middle-index contraction $\sum _{c_j, h_j, w_j} \operatorname {pdiv}_3 f \cdot \operatorname {pdiv}_3 g$.

Proof ▶

$\operatorname {pdiv}_3$ is $\operatorname {pdiv}$ of the flattened map (Definition 24), and the flattened composite is the composite of the flattened maps.
proof: Unflatten $\circ $ flatten cancels between the stages.
q.e.d.
proof: The rank-1 chain rule (Theorem 2) at the flat level, then re-index the flat middle sum to $(c_j, h_j, w_j)$ triples through the index bijection.

Theorem 63 3D VJP chain rule

✓

Rank-3 transcription of Theorem 10: composite backward $B(x, dy) = B_f\bigl(x,\, B_g(f\, x,\, dy)\bigr)$.

Proof ▶

suffices: the composite backward matches the $\operatorname {pdiv}_3$ triple-sum contraction.
proof: Definition 24.
q.e.d.
proof: Expand both correct fields, substitute the 3D chain rule (Theorem 62), then swap the two triple sums by packing each into a product index (Finset.sum_product, Finset.sum_comm) — the rank-3 rendition of Theorem 10 steps 3–6.

Theorem 64 3D additive fan-in

✓

Rank-3 transcription of Theorem 11: backward is the sum of the two backwards.

Proof ▶

By Definition 24 it suffices to match the triple-sum contraction: expand both correct fields, merge the three nested sums (Finset.sum_add_distrib three times), and apply the sum rule (Theorem 3, through the flatten bijection) to rewrite $\operatorname {pdiv}_3 f + \operatorname {pdiv}_3 g = \operatorname {pdiv}_3(f + g)$. q.e.d.

9.2 Attention proofs

Theorem 65 Softmax Jacobian

✓

Writing $p := \mathrm{softmax}(z)$: prove: $\operatorname {pdiv}(\mathrm{softmax})\, z\, i\, j = p_j\, (\delta _{ij} - p_i)$.

Proof ▶

$\operatorname {pdiv}(\mathrm{softmax})\, z\, i\, j = \operatorname {fderiv}_{\mathbb {R}}\, \bigl(z' \mapsto e^{z'_j} \cdot (\textstyle \sum _k e^{z'_k})^{-1}\bigr)\, z\, (\mathbf{e}_i)$.
proof: Definition 1; extract the $j$-th coordinate (fderiv_apply), differentiable because the denominator $S := \sum _k e^{z_k}$ is a sum of positive exponentials, hence nonzero.
Differentiate the product: numerator via HasFDerivAt.exp, denominator via (hasDerivAt_inv).comp_hasFDerivAt (licensed by $S {\gt} 0$).
Evaluate at $\mathbf{e}_i$: the numerator’s derivative contributes $e^{z_j}\, \delta _{ji}$, and the denominator’s contributes $-e^{z_j}\, S^{-2} \cdot e^{z_i}$, the inner sum collapsing by $\sum _k e^{z_k}\, \delta _{ki} = e^{z_i}$.
q.e.d.
proof: Combine 2–3 and rewrite in terms of $p = e^{z}/S$: $S^{-1} e^{z_j} \delta _{ij} - e^{z_j} S^{-2} e^{z_i} = p_j(\delta _{ij} - p_i)$ (field_simp, ring).

Theorem 66 Standalone softmax VJP

✓

$\mathsf{HasVJP}\, (\mathrm{softmax})$, with the closed-form $O(c)$ backward

\[ B(z, dy)_i = p_i\, \bigl(dy_i - \langle p, dy \rangle \bigr), \]

where $\langle p, dy \rangle = \sum _j p_j\, dy_j$ is one precomputed scalar — the rank-1 structure of the Jacobian is what turns the naive $O(c^2)$ contraction into a reduction plus a broadcast, the same optimization pattern as BN and max-pool.

Proof ▶

suffices: $B(z, dy)_i = \sum _j \operatorname {pdiv}(\mathrm{softmax})\, z\, i\, j \cdot dy_j$.
proof: Definition 9.
$\sum _j p_j (\delta _{ij} - p_i)\, dy_j = p_i\, dy_i - p_i \sum _j p_j\, dy_j$.
proof: Substitute the softmax Jacobian (Theorem 65); split the sum, collapse the $\delta $-term (Finset.sum_ite), factor $p_i$ out of the second (Finset.mul_sum).
q.e.d.
proof: The right side of 2 is $p_i(dy_i - \langle p, dy\rangle ) = B(z, dy)_i$; with 1, done.

Theorem 67 Row-wise softmax VJP on a matrix

✓

$\mathsf{HasVJPMat}\, (\mathrm{rowSoftmax})$: rows are independent, so the backward applies the softmax backward per row.

Proof ▶

The rowwise-lifting argument (Theorem 61) instantiated at $g = \mathrm{softmax}$:

The Jacobian is block-diagonal with the standalone softmax Jacobian in each block.
proof: Row-independence (Theorem 49); softmax is differentiable (positive denominator).
q.e.d.
proof: Contract with $dY$, collapse the row sum at $k = i$; what remains is Theorem 66’s correctness equation at row $i$; Definition 48 closes.

Theorem 68 SDPA backward wrt Q

✓

prove: for fixed $K, V$: $\mathrm{sdpa\_ back\_ Q}$ matches the $\operatorname {pdivMat}$ contraction of $Q' \mapsto \mathrm{sdpa}(Q', K, V)$.

Proof ▶

The forward is a four-link chain:

\[ Q' \; \mapsto \; Q' K^T \; \mapsto \; \tfrac {1}{\sqrt{d}} \cdot {\mathord {@}} \; \mapsto \; \mathrm{rowSoftmax}\, {\mathord {@}} \; \mapsto \; {\mathord {@}} \cdot V. \]

proof: Definitional (sdpa_Q_chain_eq).
Each link has a proved $\mathsf{HasVJPMat}$: matmul-right-const at $K^T$ (Theorem 58), scalar-scale (Theorem 59), row-wise softmax (Theorem 67), matmul-right-const at $V$; glue with the matrix chain rule (Theorem 50) three times.
proof: The flat-differentiability side conditions: the first two links are linear (fun_prop) and rowSoftmax is smooth (Theorem 78); compositions inherit.
q.e.d.
proof: The chain’s composed backward literally computes $\mathrm{sdpa\_ back\_ Q}$’s nested formula (pure unfolding), so the chain’s correct field is the goal.

Theorem 69 SDPA backward wrt K

✓

prove: for fixed $Q, V$: $\mathrm{sdpa\_ back\_ K}$ matches the $\operatorname {pdivMat}$ contraction of $K' \mapsto \mathrm{sdpa}(Q, K', V)$.

Proof ▶

Same shape as Theorem 68, with a leading transpose link:

The forward chain is $K' \mapsto K'^T \mapsto Q \cdot {\mathord {@}} \mapsto \tfrac {1}{\sqrt{d}} \cdot {\mathord {@}} \mapsto \mathrm{rowSoftmax}\, {\mathord {@}} \mapsto {\mathord {@}} \cdot V$, the second link now matmul-left-const (Theorems 60, 57).
Glue the five links with Theorem 50 (four applications), differentiabilities as before.
q.e.d.
proof: The chain’s backward computes $\sum _k Q_{kj}\, d\mathrm{Scores}_{ki}$ while $\mathrm{sdpa\_ back\_ K} = (d\mathrm{Scores})^T Q$ expands to $\sum _k d\mathrm{Scores}_{ki}\, Q_{kj}$ — equal by mul_comm at the summand.

Theorem 70 SDPA backward wrt V

✓

prove: for fixed $Q, K$: $\mathrm{sdpa\_ back\_ V}$ matches the $\operatorname {pdivMat}$ contraction of $V' \mapsto \mathrm{sdpa}(Q, K, V')$.

Proof ▶

$V'$ enters only through the final matmul: $\mathrm{sdpa}(Q, K, V') = W \cdot V'$ with $W := \mathrm{sdpa\_ weights}(Q, K)$ fixed.
proof: Definitional (sdpa_eq_mul_weights).
q.e.d.
proof: Theorem 57’s backward is $W^T \cdot d\mathrm{Out}$, which unfolds to exactly $\mathrm{sdpa\_ back\_ V}$.

Theorem 71 Multi-head SDPA VJP

✓

Phase 3 (Apr 2026), de-axiomatized Phase 8. prove: $\mathsf{HasVJPMat}\, (\mathrm{mhsa\_ layer})$.

Proof ▶

Sketch: column-stacking — each head touches only its own column slab, so the single-head SDPA backward lifts across the head axis the same way row-independence lifts vector VJPs across rows.

$\mathrm{mhsa\_ layer}$ factors as $(\text{per-token dense } W_o) \circ \mathrm{colSlabApply}(\text{single-head SDPA}) \circ (\text{per-token fused QKV dense})$.
proof: Definitional (mhsa_layer_eq_compose).
The middle factor’s Jacobian is block-diagonal across heads: pdivMat_colIndep, the column-axis analogue of row-independence (Theorem 49), lifts the proved single-head backward (Theorems 68–70) over the head axis (colSlabwise_has_vjp_mat).
Glue the three factors with the matrix chain rule (Theorem 50) twice; the dense factors are per-token lifts (Theorem 76).
q.e.d.
proof: The backward field is taken from the composed witness directly (kernel-cheap projection); the composition-equality transport of step 1 lives only in the correct field — a Prop the kernel never reduces.

Theorem 72 MHSA layer smoothness

✓

$\mathsf{Differentiable}$ sibling of Theorem 71.

Proof ▶

Factor as in Theorem 71 step 1; each factor’s flattened form is differentiable (two per-token dense lifts, and $\mathrm{colSlabApply}$ of the single-head function, whose smoothness comes from the same chain of exp/inv positivity as Theorem 78); compositions inherit differentiability. q.e.d.

Theorem 73 Patch-embedding VJP

✓

Phase 6b (Apr 2026): de-opaqued in 6a — the concrete def unfolds to conv-with-stride + CLS prepend + positional embed. prove: $\mathsf{HasVJP}\, (\mathrm{patchEmbed\_ flat})$, with backward the deconvolution formula: a sum over patches with reconstructed kernel offsets, the CLS row contributing nothing to the image gradient.

Proof ▶

Sketch: the conv2d input-VJP recipe (Theorem 25) reapplied to the strided patch convolution, with one new wrinkle at the collapse: the CLS row.

Define $B(\mathrm{img}, dy)$ at input position $(c, h, w)$ as $\sum _{p, k_h, k_w} [\text{offset match}] \sum _d W^{\mathrm{conv}}_{d, c, k_h, k_w} \cdot dy_{\mathrm{flat}(p + 1,\, d)}$, where patch $p$ decodes to grid position $(\lfloor p / W' \rfloor , p \bmod W')$.
suffices: $B$ matches the $\operatorname {pdiv}$ contraction.
proof: Definition 9.
The forward decomposes as $(\text{constant: bias} + \text{CLS} + \text{pos-embed}) + (\text{img-linear part})$, where the pad guard absorbs the $n = 0$ (CLS-row) test so the linear part is uniform in $\mathrm{img}$.
proof: Unfold; definitional.
Per-index Jacobian by the conv2d recipe: sum + constant rules drop the constants (Theorems 3, 6); finite-sum rule (Theorem 8) $\times 3$; product rule (Theorem 4) with constant $W^{\mathrm{conv}}$ factor; guarded projection = reindex-or-zero (Theorems 7, 6).
Contract with $dy$ and collapse as in Theorem 25 step 5, with the new wrinkle: split the output-row sum over $\mathrm{Fin}(N{+}1)$ into the CLS row (contributes $0$) plus $\sum _{p : \mathrm{Fin}\, N}$ at rows $n = p + 1$ (Fin.sum_univ_succ).
q.e.d.
proof: What remains is $B$; with 2, done.

Theorem 74 Patch-embedding smoothness

✓

$\mathsf{Differentiable}$ sibling of Theorem 73.

Proof ▶

The decomposition of Theorem 73 step 3 exhibits the flattened forward as constant + linear-in-img (each summand a constant times a guarded projection); both pieces are differentiable and the sum inherits it. q.e.d.

Theorem 75 Per-token LayerNorm lifted to a matrix

✓

Proof ▶

Instantiate the rowwise lift (Theorem 61) at $g = \mathrm{layerNormForward}$, with Theorem 46 as the $\mathsf{HasVJP}$ witness and its differentiability sibling ($\varepsilon {\gt} 0$) as the smoothness witness. The backward is block-diagonal: each token’s gradient is computed independently by the 1D three-term formula. q.e.d.

Theorem 76 Per-token dense lifted to a matrix

✓

Proof ▶

Instantiate the rowwise lift (Theorem 61) at $g = \mathrm{dense}(W, b)$, with Theorem 18 as the $\mathsf{HasVJP}$ witness; dense is affine, hence everywhere differentiable. This is $Q = XW + b$ row-by-row with shared weights. q.e.d.

Theorem 77 Per-token GELU lifted to a matrix

✓

Proof ▶

Instantiate the rowwise lift (Theorem 61) at $g = \mathrm{gelu}$, with Theorem 47 as the $\mathsf{HasVJP}$ witness and the tanh-chain smoothness as the differentiability witness. Elementwise activation: the Jacobian is diagonal both across rows and within a row. q.e.d.

Theorem 78 Row-wise softmax smoothness

✓

prove: the flattened $\mathrm{rowSoftmax}$ is $\mathsf{Differentiable}$.

Proof ▶

case $n = 0$: the codomain is $0$-dimensional; the map is constant.
proof: Trivial.
case $n \ge 1$: each output coordinate is $v \mapsto e^{v_{(r, c)}} \cdot \bigl(\sum _j e^{v_{(r, j)}}\bigr)^{-1}$. The denominator is a positive finite sum (Real.exp_pos, Finset.sum_pos over the nonempty index set), so Differentiable.inv applies and the product of exponential chains closes.
q.e.d.
proof: Differentiability per coordinate gives differentiability of the Pi-valued map (differentiable_pi).

Theorem 79 Transformer MLP sublayer VJP

✓

$\mathsf{HasVJPMat}$ of $\mathrm{dense}_2 \circ \mathrm{gelu} \circ \mathrm{dense}_1$, all per-token.

Proof ▶

Inner composition $\mathrm{gelu} \circ \mathrm{dense}_1$: matrix chain rule (Theorem 50) on the two per-token lifts (Theorems 76, 77); both flat forms are differentiable (dense affine, GELU via the tanh chain), and the composition inherits it.
q.e.d.
proof: One more chain-rule application against the outer $\mathrm{dense}_2$ lift.

Theorem 80 Transformer attention sublayer with residual VJP

✓

assume: $\varepsilon {\gt} 0$ [h$\varepsilon $]
prove: $\mathsf{HasVJPMat}$ of $X \mapsto X + \mathrm{MHSA}(\mathrm{LN}_1(X))$.

Proof ▶

Inner arm $\mathrm{MHSA} \circ \mathrm{LN}_1$: matrix chain rule (Theorem 50) on Theorems 75 and 71, differentiabilities from the $\varepsilon {\gt} 0$ LN smoothness and Theorem 72.
q.e.d.
proof: Matrix additive fan-in (Theorem 51) of the identity arm (Theorem 52) and the inner arm — the same residual pattern as Theorem 37, one rank up.

Theorem 81 Transformer MLP sublayer with residual VJP

✓

assume: $\varepsilon {\gt} 0$ [h$\varepsilon $]
prove: $\mathsf{HasVJPMat}$ of $h \mapsto h + \mathrm{MLP}(\mathrm{LN}_2(h))$.

Proof ▶

Identical structure to Theorem 80: chain $\mathrm{MLP} \circ \mathrm{LN}_2$ (Theorems 79, 75 via Theorem 50), then $\operatorname {biPathMat}$ with the identity arm (Theorems 51, 52). q.e.d.

Theorem 82 Transformer block VJP

✓

prove: $\mathsf{HasVJPMat}$ of one pre-norm encoder block $z = x + \mathrm{MHSA}(\mathrm{LN}_1 x)$, $y = z + \mathrm{MLP}(\mathrm{LN}_2 z)$.

Proof ▶

One matrix chain rule (Theorem 50) glueing the attention sublayer (Theorem 80) to the MLP sublayer (Theorem 81), with each sublayer’s flat differentiability discharged from its arms (both are $\mathrm{id} + \text{smooth}$). q.e.d.

Theorem 83 Transformer tower, any depth

✓

prove: $\mathsf{HasVJPMat}$ of the $k$-block tower, for every $k$ — ViT-Tiny/Base ($k = 12$) and Large ($k = 24$) are instances.

Proof ▶

Induction on $k$:

case $k = 0$: the tower is the identity.
proof: Theorem 52.
case $k + 1$: the tower is $\mathrm{block} \circ \mathrm{tower}_k$.
proof: Matrix chain rule (Theorem 50) on the induction hypothesis and one block (Theorem 82), with the tower’s flat differentiability carried through the same induction.
q.e.d.
proof: By induction, 1 and 2 cover every depth.

Theorem 84 ViT body: the grand finale

✓

prove: the full ViT transformer backbone $\mathrm{finalLN} \circ \mathrm{transformerTower}$ is one $\mathsf{HasVJPMat}$.

Proof ▶

One final matrix chain rule (Theorem 50) glueing the $k$-block tower (Theorem 83) to the final per-token LayerNorm (Theorem 75). Every step of the chain that began with vjp_comp in Chapter 1 is a proved theorem; this is its last link. q.e.d.

9.3 Example: ViT-Tiny on Imagenette

This is the capstone. Every layer proved in the preceding chapters — dense, ReLU, softmax cross-entropy, convolution, maxPool, batch norm, residual, depthwise conv, squeeze-and-excitation, layer norm, GELU, and the attention mechanism itself — composes into this one architecture. If the NetSpec below compiles and the VJP of its body is proved (§ 84), then every layer in modern deep learning you’ve seen in this book has its backward pass machine-checked.

ViT-Tiny (Dosovitskiy et al. 2020) was the smallest Vision Transformer variant in the original paper. It’s also the least data-hungry — the big ViT variants start beating ConvNets at ImageNet-21K scale and above. On a 9469-image Imagenette training set, ViT-Tiny underperforms the ResNet-34 / MobileNet V2 / EfficientNet-B0 from earlier chapters. That’s part of the teaching point: transformers are not a universal improvement, they are a different scaling regime.

The architecture

ViT-Tiny is the shortest spec in Part 1: .patchEmbed chops the image into $16 \times 16$ patches and projects each to a 192-dim token (196 of them, plus a prepended CLS token), .transformerEncoder runs twelve pre-norm blocks, and a final LayerNorm feeds the CLS token to a dense head. The inset shows one of the twelve transformer blocks — two residual sub-blocks, attention then MLP, each pre-normed.

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.56cm, minimum width=6.7cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, norm/.style = {col, draw=teal!60!black, fill=teal!10}, blk/.style = {col, draw=blue!45!violet, fill=violet!8}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, stage/.style = {font=\sffamily\scriptsize\itshape, gray!55!black, anchor=west}, ] \node[io] (input){Input \;\; $224\times224\times3$}; \node[norm, below=0.16cm of input] (pe) {\textbf{PatchEmbed} $16\times16$ patches $\to 196\times192$}; \node[norm, below=0.16cm of pe] (cls) {prepend CLS token $+$ pos embed \;\; $197\times192$}; \node[blk, below=0.16cm of cls] (enc) {$12\times$ \textbf{Transformer encoder}\\dim 192, 3 heads, MLP 768}; \node[norm, below=0.16cm of enc] (ln) {final LayerNorm}; \node[head, below=0.16cm of ln] (d) {\textbf{Dense} $192\to10$ \;(identity), on CLS}; \node[logits, below=0.16cm of d] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/pe,pe/cls,cls/enc,enc/ln,ln/d,d/out}\draw[arr](\a)--(\b); \node[stage] at ($(enc.east)+(0.30,0)$){$\times 12$ blocks}; \end{tikzpicture}$

$\begin{tikzpicture} [ >={Stealth[length=1.6mm]}, every node/.style={font=\sffamily\scriptsize}, proc/.style={align=center, rounded corners=2pt, inner sep=3pt, minimum height=0.58cm, draw=blue!45!violet, fill=violet!8}, op/.style ={circle, draw=blue!45!violet, fill=violet!14, inner sep=0pt, minimum size=0.46cm}, term/.style={font=\sffamily\scriptsize\itshape, inner sep=1.5pt}, flow/.style={->, thick, gray!60, shorten >=1.5pt, shorten <=1.5pt}, skip/.style={->, thick, blue!45!violet, shorten >=1.5pt}, ] \node[term] (x){$x$}; \node[proc, right=0.5cm of x] (ln1){LN}; \node[proc, right=0.35cm of ln1](mh){MHSA}; \node[op, right=0.5cm of mh] (s1){$+$}; \node[proc, right=0.5cm of s1] (ln2){LN}; \node[proc, right=0.35cm of ln2](ml){MLP}; \node[op, right=0.5cm of ml] (s2){$+$}; \node[term, right=0.5cm of s2] (y){$y$}; \foreach \a/\b in {x/ln1,ln1/mh,mh/s1,s1/ln2,ln2/ml,ml/s2,s2/y}\draw[flow](\a)--(\b); \draw[skip] (x.north) .. controls +(0,0.72) and +(0,0.72) .. (s1.north) node[midway, above, term, blue!45!violet]{$+\,x$}; \draw[skip] (s1.north) .. controls +(0,0.72) and +(0,0.72) .. (s2.north) node[midway, above, term, blue!45!violet]{$+\,z$}; \node[term, below=0.18cm of ln2, gray!55!black]{one transformer block: $z = x + \mathrm{MHSA}(\mathrm{LN}\,x)$, \ $y = z + \mathrm{MLP}(\mathrm{LN}\,z)$}; \end{tikzpicture}$

-- 1
import LeanMlir

-- 2
def vitTiny : NetSpec where
  name   := "ViT-Tiny"
  imageH := 224
  imageW := 224
  layers := [
    .patchEmbed 3 192 16 196,              -- (224/16)^2 = 196 patches
    .transformerEncoder 192 3 768 12,      -- dim=192, 3 heads, mlp=768, 12 blocks
    .dense 192 10 .identity                -- classification head
  ]

-- 3
def vitTinyConfig : TrainConfig where
  learningRate   := 0.0003     -- 3× smaller than the ConvNets
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 5          -- longer warmup too
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  vitTiny.train vitTinyConfig (args.head?.getD "data/imagenette")

Three layers. The entire model body is patchEmbed, transformerEncoder, dense. .patchEmbed 3 192 16 196 chops the $224 \times 224 \times 3$ image into $14 \times 14 = 196$ non-overlapping $16 \times 16$ patches, linear-projects each to 192 dimensions, and prepends a learned $\mathrm{CLS}$ token. .transformerEncoder 192 3 768 12 runs 12 standard encoder blocks (multi-head self-attention + MLP + LayerNorm + residual, twice per block). .dense 192 10 .identity classifies off the CLS token. Every one of those sub-ops has its backward pass proved — the composition of all of them is § 84, the grand-finale theorem of Part 1.

Note the learning rate is 0.0003 (a third of what the ConvNets used) and warmup is 5 epochs (longer than the ConvNets’ 3). ViTs need gentler optimization schedules — the attention softmax is prone to collapse if gradients push logits too hard early, and warmup keeps the first few epochs slow enough to avoid that failure mode.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/vit-tiny-train
ViT-Tiny: 5526346 params
Generating train step MLIR...
  742145 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images (256×256)
  5526346 params + m + v (63 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.000300,
          cosine, label_smooth=0.1, wd=1e-4
  no BN (LayerNorm only)
  step 0/295: loss=2.302585 (366ms)
Epoch 10/80: loss=1.552973 lr=0.000298 (102247ms)
  val accuracy: 2121/3904 = 54.33%
Epoch 20/80: loss=1.384668 lr=0.000275 (100662ms)
  val accuracy: 2258/3904 = 57.84%
Epoch 40/80: loss=1.178420 lr=0.000172 (102352ms)
  val accuracy: 2642/3904 = 67.67%
Epoch 60/80: loss=1.042742 lr=0.000054 (102175ms)
  val accuracy: 2779/3904 = 71.18%
Epoch 80/80: loss=0.986243 lr=0.000000 (102234ms)
  val accuracy: 2799/3904 = 71.70%

(Full log in logs/vit_train.log.)

Final val accuracy 71.70% on Imagenette, wall time $\sim $2.3 hours — the fastest of any Imagenette-scale model in Part 1. Expanding the comparison table from the ConvNeXt chapter:

Model	Params	MLIR	Step time	Total	Val acc
ResNet-34	21.29M	518 KB	1400 ms	9.5 h	90.29%
MobileNet V2	2.24M	741 KB	830 ms	5.4 h	87.09%
EfficientNet-B0	7.16M	938 KB	940 ms	6.2 h	87.58%
ConvNeXt-T	27.83M	790 KB	2030 ms	13.3 h	84.94%
ViT-Tiny	5.53M	742 KB	360 ms	2.3 h	71.70%

ViT-Tiny is the fastest-per-step and the lowest-accuracy. That’s the data-hunger effect: 9469 training images is tiny for a transformer. The ViT paper reached ImageNet-competitive accuracy only when trained on ImageNet-21K (14M images) or JFT-300M (300M images); at Imagenette scale the ConvNets’ inductive bias (locality, translation equivariance) actively helps, and the ViT has to learn those properties from data it doesn’t have.

9.4 MLIR: Attention

What is already proven. Scaled dot-product attention is $\mathrm{sdpa}(Q,K,V) = \mathrm{softmax}(QK^{\! \top }/\sqrt{d})\, V$. Its reverse-mode derivative produces three gradients — $dQ$, $dK$, $dV$ — each proven separately (§§ 68–70, sdpa_back_Q/K/V), all built on the softmax backward, which couples every position to every other (§ 66). Because $Q$, $K$, and $V$ are three linear projections of the same input $a$, the input gradient is a three-way fan-in, $da = dQ\, W_q^{\! \top } + dK\, W_k^{\! \top } + dV\, W_v^{\! \top }$; sdpa_has_vjp_mat3 bundles the three, and transformerAttnSublayer_has_vjp_mat composes the whole sublayer (LayerNorm, the QKV projections, attention, the output projection, the residual) through vjp_comp_at.

The gap and how we close it. The softmax backward is non-elementwise and the three projections are coupled. The emitted graph denotes transformerAttnSublayer_has_vjp_mat’s backward. Here is the three-way fan-in at the input (two tokens, model dimension four, single head; the SDPA backward that produces $dQ,dK,dV$ elided to its comment):

// dQ,dK,dV = SDPA.back(dOut . Wo^T)  (softmax-weighted attention grads)
%daQ = stablehlo.dot_general %dQ, %Wq, contracting_dims = [1] x [1]
         : (tensor<2x4xf32>, tensor<4x4xf32>) -> tensor<2x4xf32>
%daK = stablehlo.dot_general %dK, %Wk, contracting_dims = [1] x [1]
         : (tensor<2x4xf32>, tensor<4x4xf32>) -> tensor<2x4xf32>
%daV = stablehlo.dot_general %dV, %Wv, contracting_dims = [1] x [1]
         : (tensor<2x4xf32>, tensor<4x4xf32>) -> tensor<2x4xf32>
%daQK = stablehlo.add %daQ, %daK : tensor<2x4xf32>
%da = stablehlo.add %daQK, %daV : tensor<2x4xf32>
// %dxa = LN.back(%da);  %dx = %dOut + %dxa  (residual skip)

Read it against the fan-in: each dot_general carries one of the attention gradients back through its projection ($dQ$ through $W_q$, and so on, contracting the model axis); the two adds sum the three. This is the residual fan-in of Chapter 5 with three branches instead of one — attention reads the input thrice, so the gradient returns thrice and adds. The softmax-weighted $dQ,dK,dV$ that feed the dot_generals come from the proven sdpa_back_Q/K/V, and the surrounding LayerNorm and output projection chain through their bridges.

One honesty note: the hard part of attention is in that first comment. This listing is the projection fan-in — three linear adjoints and two adds — while the softmax backward that couples every token to every other (softmax_has_vjp, § 9.2) is what // SDPA.back stands in for. We show the fan-in because it is the new structural move; the coupled core is carried by the proofs of § 9.2.

With attention pinned, the verified-codegen thread covers every operator in the modern vision stack — dense, convolution, BatchNorm, residual, depthwise, squeeze-excite, layer scale, and attention — each emitted as the rendering of a machine-checked derivative, from the same three pieces: denoted IR, bridge theorem, printer.

Caveats.

The attention bridge is unconditional — softmax and the linear projections are smooth everywhere, with no smooth-point clause.
Representative scale (two tokens, $d = 4$, one head).

9.5 Data Augmentation

The ViT-Tiny config we just trained uses a bare base recipe — random crop + hflip on the data side, no Mixup / CutMix / RandAugment / heavy weight decay. Holding architecture and base optimizer fixed, we can measure the marginal effect of each modern augmentation knob layered on top, on the same Imagenette data and 80-epoch budget.

Cell	Val acc	$\Delta $ vs bare
vit-tiny-cutmix-wd05	77.43%	$+5.7$
vit-tiny-cutmix	77.10%	$+5.4$
vit-tiny-mixup	74.69%	$+3.0$
vit-tiny-recipe2	74.41%	$+2.7$
vit-tiny-full	74.23%	$+2.5$
vit-tiny-randaug (M=9)	74.03%	$+2.3$
vit-tiny-randaug-m4 (M=4)	71.98%	$+0.3$
vit-tiny-bare	71.70%	—
vit-tiny-erase	70.62%	$-1.1$

(recipe2 = CutMix + Random Erasing + RandAugment-M9; full = Mixup + Random Erasing.)

CutMix is the load-bearing knob. +5.4% over bare. No close substitute; Mixup is a partial second at +3.0%.
Stacking on top of CutMix gives near-zero return. CutMix + heavy WD adds $+0.3\% $, well within seed noise. CutMix + RA + RE (recipe2) is negative ($-2.7\% $ vs CutMix alone): at 9.5K training images, piling on more aug erodes the per-image class signal faster than the diversity helps.
Random Erasing alone hurts ($-1.1\% $) at this scale. Removing 2–33% of pixels per image needs more redundant signal elsewhere than a 9.5K-image dataset has.
RandAugment magnitude matters. M=9 (paper default) works ($+2.3\% $). M=4 is essentially identity ($+0.3\% $); too gentle to bite.

The reading for a small-data practitioner: pick one strong knob (CutMix) and stop. The reading for a large-data practitioner: the same table at 1.2M images would lift different rows — the marginal effect of each knob is scale-dependent. The framing worth keeping is "marginal effect on top of what you already have," not the "stack everything" framing implicit in published paper recipes.

One seed per cell. The 0.3% delta between CutMix and CutMix+WD is below noise; the 5.4% delta from CutMix is well above.

9.6 ImageNet recipe

Why the phase-2 trainer? The ImageNet runs in this book use the phase-2 (Lean$\to $JAX) trainer: at this scale its job is to validate the framework’s logic end to end. Whether the phase-3 verified-IREE codegen can reach ImageNet under its own codegen rules is an open question — phase-2 is how we establish these baselines in the meantime.

The Imagenette ViT-Tiny above trained on $\sim $9.5K images. The full 1000-class run is the same architecture — patch16, embed 192, 12 transformer blocks, 3 heads, $\sim $5.7M parameters — with a 1000-class head, the .imagenet dataset, and the DeiT-flavored training recipe. Here’s what the loop looks like:

-- 1. Same ViT-Tiny backbone, 1000-class head.
def vitTinyImagenet : NetSpec where
  name   := "ViT-Tiny (ImageNet, bf16)"
  imageH := 224
  imageW := 224
  layers := [
    .patchEmbed 3 192 16 196,            -- (224/16)^2 = 196 patches
    .transformerEncoder 192 3 768 12,    -- 12 blocks, 3 heads, MLP 768
    .dense 192 1000 .identity            -- 1000-class head
  ]

-- 2. Full DeiT-Ti recipe: AdamW, cosine, the aug suite, stochastic depth, EMA.
def vitTinyImagenetConfig : TrainConfig where
  learningRate      := 5e-4          -- proper DeiT batch-512 LR
  batchSize         := 512
  epochs            := 300           -- full DeiT-Ti schedule
  useAdam           := true          -- AdamW (decoupled weight decay)
  weightDecay       := 0.05
  wdExcludeNormBias := true          -- no_weight_decay: skip norm/bias/pos-embed/CLS
  cosineDecay       := true
  warmupEpochs      := 5
  labelSmoothing    := 0.1
  gradClipNorm      := 1.0           -- the unlock for the 5e-4 LR (see below)
  useMixup          := true          -- Mixup alpha 0.8
  useCutmix         := true          -- CutMix alpha 1.0 (alternates with Mixup)
  useRandAugment    := true          -- rand-m9-mstd0.5-inc1 (color + geometric)
  randomErasing     := true          -- Random-Erase p 0.25
  dropPath          := 0.1           -- stochastic depth (linear 0 -> 0.1)
  useEMA            := true          -- model EMA decay 0.99996 (eval on shadow)
  bf16              := true          -- bf16 matmul compute

-- 3. Same .train entry point as every other chapter.
def main (args : List String) : IO Unit :=
  vitTinyImagenet.train vitTinyImagenetConfig
    (args.head?.getD "data/imagenet") .imagenet

As with the ResNet-34 ImageNet run (Ch 5), the phase-2 JAX path (jax/MainVitImagenet.lean, via lake build vit-tiny-imagenet) wires .imagenet through tfds streaming; the phase-3 IREE path awaits the same C-side streaming reader for the 1.28M-image set. Only the data-loading layer differs — the spec, the recipe knobs, and the .train invocation are identical.

The stability knob. ViT-Tiny is where the recipe stops being optional. At the proper DeiT learning rate (peak $5\times 10^{-4}$ for batch 512) the model collapses to chance the moment warmup ramps the LR past ${\sim }1.6\times 10^{-4}$ — train loss pins at $\ln (1000)\approx 6.9$ and never recovers. This is the well-known no-grad-clip transformer instability, sharpened here by the 1000-class softmax and bf16 gradient noise. The fix is one field: gradClipNorm := 1.0, global-norm gradient clipping, the DeiT default. With it, the very LR that collapsed without it trains cleanly through warmup and on to convergence.

bf16 and the run. ViT is matmul-bound — patch embed, attention QKV/scores/output, the MLP, the head are all dense matrix multiplies, with no convolution to speak of. So unlike the ResNet case, the relevant precision flag is bf16 alone (matmul compute in bfloat16, fp32 master weights, fp32 LayerNorm/softmax/GELU); there is no bf16Conv to set. bf16 matmul runs through the same NVIDIA tensor cores. Measured on the same four RTX 4060 Ti (CUDA) at batch 512, bf16 is $\mathbf{1.48\times }$ faster per step than fp32 ($176$ vs $260$ ms steady-state) — a smaller multiple than the convnet’s $1.6\times $, since ViT-Tiny’s matmuls are narrow (embed $192$) and spend proportionally more time in the fp32 LayerNorm/softmax that bf16 leaves untouched. On the ROCm box the effect is larger, not smaller: RDNA3’s matmul units (WMMA) give bf16 a $\mathbf{2.75\times }$ compute speedup on the full ViT-Tiny step ($428 \to 156$ ms for forward + backward + AdamW at batch $256$/GPU), diluting to ${\sim }2\times $ end-to-end once the fp32 LayerNorm/softmax and the tfds pipeline fold back in. The convnets see no such lift on gfx1100 — MIOpen’s bf16 convolution is no faster than fp32 there — but ViT is all matmul, so the win is real. The runs:

GPU	Precision	Per epoch	Epochs	Wall-clock	Val top-1	Val top-5
4$\times $ 4060 Ti (CUDA)	fp32	$\sim $11.1 min	80	$\sim $15 hr	—	—
4$\times $ 4060 Ti (CUDA)	bf16	$\sim $7.6 min	80	$\sim $11 hr	$\mathbf{65.64\% }$	$\mathbf{87.06\% }$
4$\times $ 4060 Ti (CUDA)	bf16	$\sim $7.6 min	300	$\sim $41 hr	—	—
2$\times $ 7900 XTX (ROCm)	bf16	$\sim $10 min	300	$\sim $50 hr	$\mathbf{70.28\% }$	$\mathbf{90.05\% }$

The full 300-epoch DeiT-Ti recipe reached $\mathbf{70.28\% }$ top-1 / $\mathbf{90.05\% }$ top-5 on the full 50,000-image validation split — up from the 80-epoch ancestor’s $65.64\% $ / $87.06\% $. That lands within ${\sim }2$ points of the ${\sim }72\% $ ViT-Tiny is capable of; the remaining gap is the one recipe knob still left on the table (repeated augmentation) plus two minor deviations (strict Mixup/CutMix alternation instead of a random switch, and Random-Erase to zero rather than to noise). The headline is a 300-epoch DeiT result with the full distillation-free recipe, and the point here is the pipeline (proven attention backward $\to $ DeiT recipe $\to $ stable bf16 training to a sensible ImageNet number), not a SOTA chase. The validation curve shows the familiar shape — fast climb through warmup, a long middle, a final lift as the cosine schedule anneals to zero:

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Validation accuracy (\%)}, xmin=0, xmax=301, ymin=0, ymax=95, xtick={0,50,100,150,200,250,300}, ytick={20,40,60,80}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=0.8pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (5,1.40) (10,6.82) (15,15.02) (20,24.81) (25,33.47) (30,40.08) (35,45.02) (40,48.33) (45,50.57) (50,52.47) (55,53.99) (60,55.23) (65,56.29) (70,57.00) (75,57.59) (80,58.27) (85,58.88) (90,59.32) (95,59.77) (100,60.26) (105,60.63) (110,61.07) (115,61.56) (120,61.95) (125,62.51) (130,62.86) (135,63.18) (140,63.59) (145,63.81) (150,64.08) (155,64.39) (160,64.77) (165,65.15) (170,65.43) (175,65.74) (180,66.00) (185,66.30) (190,66.60) (195,66.81) (200,67.03) (205,67.32) (210,67.55) (215,67.80) (220,68.02) (225,68.25) (230,68.51) (235,68.80) (240,68.97) (245,69.15) (250,69.36) (255,69.53) (260,69.62) (265,69.75) (270,69.86) (275,69.98) (280,70.11) (285,70.24) (290,70.26) (295,70.29) (300,70.28) }; \addlegendentry{top-1} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (5,5.08) (10,18.23) (15,33.13) (20,47.34) (25,58.43) (30,65.54) (35,70.40) (40,73.52) (45,75.76) (50,77.35) (55,78.42) (60,79.38) (65,80.16) (70,81.00) (75,81.57) (80,81.98) (85,82.37) (90,82.78) (95,83.21) (100,83.48) (105,83.89) (110,84.20) (115,84.62) (120,84.95) (125,85.22) (130,85.39) (135,85.71) (140,85.99) (145,86.20) (150,86.39) (155,86.57) (160,86.79) (165,87.01) (170,87.17) (175,87.33) (180,87.53) (185,87.69) (190,87.84) (195,87.96) (200,88.15) (205,88.34) (210,88.49) (215,88.71) (220,88.87) (225,88.97) (230,89.05) (235,89.19) (240,89.28) (245,89.44) (250,89.51) (255,89.63) (260,89.67) (265,89.75) (270,89.77) (275,89.85) (280,89.88) (285,89.96) (290,90.00) (295,89.98) (300,90.05) }; \addlegendentry{top-5} \end{axis} \end{tikzpicture}$

ViT-Tiny / ImageNet-1k validation accuracy per epoch (bf16, 2$\times $ RX 7900 XTX, ROCm; sampled every 5 epochs across the 300-epoch run).

Distance to the paper. DeiT-Ti reaches $72.2\% $ without distillation; our $70.28\% $ is ${\sim }2$ points short, and the accounting is mundane. The largest single item is repeated augmentation — DeiT samples each image in a batch three times under different augmentations, which disproportionately helps the data-hungry tiny models; our pipeline does not, and it is the obvious next lever. The rest is augmentation-fidelity drift — RandomResizedCrop and Random-Erasing are a from-scratch tf.data reimplementation rather than timm’s exact code (our RandAugment now matches timm’s rand-m9-mstd0.5-inc1 op-set literally) — plus a few small knobs (batch $512$ vs. $1024$, the gradient clipping we add for stability, Random-Erase to zero rather than to noise). None of it is a bug: the loss curve is textbook DeiT. The further $+2.3$ points to DeiT-Ti’s $74.5\% $ headline are a different thing entirely — the distillation variant (a second “distillation token” trained against a strong CNN teacher), which we do not implement.

[TODO — run w/ config.] vit-tiny-imagenet, 300 epochs (a re-run of the $70.28\% $ above). Changed since that number: RandAugment is now timm-literal (rand-m9-mstd0.5-inc1: SolarizeAdd$+$Invert op-set, per-op apply prob $0.5$, bilinear geometric interpolation, relative translate). Still open: repeated augmentation ($3\times $, the largest remaining lever) stays deferred. Re-running with the timm-faithful RandAugment — and, once built, repeated augmentation — is the path to close the $\sim $2 points to DeiT-Ti’s $72.2\% $.

What Part 1 has established

Every layer primitive shipped in Part 1’s trainers (MLP, CNN, ResNet, MobileNet, EfficientNet, ViT) has a machine-checked backward pass. The same NetSpec / TrainConfig / .train pipeline trains all of them with at most config-level changes. The MLIR each emits is between 20 KB (MLP) and 938 KB (EfficientNet-B0), and IREE compiles all of them the same way.

The bestiary in Part 2 then shows that every other architecture you’ve seen in modern deep learning — UNet, DETR, Mask R-CNN, DCGAN, CycleGAN, Pix2Pix, DDPM, Stable Diffusion, VAE, AlphaGo, AlphaZero, MuZero, Mamba, BERT, GPT, Whisper, CLIP, LLaVA, SAM, SegFormer, DeepLab v3+, Nyström-former, QANet, Evoformer — composes from the same small set of primitives plus a handful of architecture-specific bundled layers. Three rules (chain, additive fan-in, multiplicative fan-in). Five Jacobian tricks (diagonal, sparse Toeplitz, binary selection, rank-1 correction, outer product).

That’s the whole framework. Part 2’s bestiary takes the same set of primitives and tours how they compose into every well-known modern architecture — a survey of the field expressed in the proven kit, with a small library of bundled idioms for the patterns that recur.

Cell	Val acc	\(\Delta \) vs bare
vit-tiny-cutmix-wd05	77.43%	\(+5.7\)
vit-tiny-cutmix	77.10%	\(+5.4\)
vit-tiny-mixup	74.69%	\(+3.0\)
vit-tiny-recipe2	74.41%	\(+2.7\)
vit-tiny-full	74.23%	\(+2.5\)
vit-tiny-randaug (M=9)	74.03%	\(+2.3\)
vit-tiny-randaug-m4 (M=4)	71.98%	\(+0.3\)
vit-tiny-bare	71.70%	—
vit-tiny-erase	70.62%	\(-1.1\)