Verified Deep Learning with Lean 4

10 Vision Transformer

Where Part 1 has been heading

Every previous chapter has introduced a new architectural primitive and proved its backward pass. Convolution, BatchNorm, residual, depthwise, SE, LayerNorm, GELU—each one slotted into the NetSpec type and earned its own has_vjp theorem. This chapter does the same for attention, but it also does something larger: it composes every layer the framework has proved so far into one machine-checked backward pass over a complete state-of-the-art image classifier.

The destination is Theorem 81 at the end of the chapter. It says, roughly: the full ViT backbone—patch embedding, twelve transformer blocks, final LayerNorm—is a \(\mathsf{HasVJPMat}\). The proof composes everything this chapter introduces, plus most of what the previous chapters introduced, into a single tower of vjpMat_comp applications. Read the proof’s dependency list and you see a tour of Part 1: dense, residual, layer norm, GELU, softmax, matrix multiply, plus the eight or ten matrix-machinery lemmas this chapter sets up. None of them is new in its own right; what’s new is how high the composition rule can be pushed.

What attention is, in one screen

A transformer encoder block does two things to its input sequence \(X \in \mathbb {R}^{N \times d}\) (think of \(N\) image patches, each represented as a \(d\)-vector):

  1. Multi-head self-attention (MHSA): each output row is a weighted average of all input rows, where the weights are computed from the input itself.

  2. Per-token MLP: a small two-layer MLP applied independently to every row.

Both halves are wrapped in a LayerNorm + residual sandwich (exactly the same template as ConvNeXt’s blocks from Chapter 9; the layout transferred to convnets in 2022 actually came from transformers in 2017).

The interesting half is the first one. Scaled dot-product attention computes, for queries \(Q\), keys \(K\), and values \(V\) (all linear projections of \(X\)):

\[ \mathrm{Attention}(Q, K, V) \; =\; \mathrm{softmax}\! \left(\frac{Q K^\top }{\sqrt{d_h}}\right) V. \]

In words: every row of \(Q\) is compared to every row of \(K\) via inner product (the \(QK^\top \) matrix is \(N \times N\), one entry per query-key pair). Each row of that \(N \times N\) matrix is passed through softmax to get a probability distribution over the keys (which positions to attend to). Those probabilities then weight the rows of \(V\) to produce the output. The \(1 / \sqrt{d_h}\) scaling keeps the softmax from saturating at large \(d_h\). Multi-head attention just runs several copies of this in parallel on lower-dimensional projections, concatenating the outputs.

Each token—meaning each row of \(X\)—ends up looking at every other token. That’s the qualitative break from CNNs: convolutions look only at a local neighborhood, attention looks everywhere. The receptive field is global from the first layer.

Why this chapter has thirty theorems

The math of attention is matrix algebra. Chapter 2’s foundation rules—chain, sum, product, identity—were stated and proved for vector functions. Before we can prove backward passes through \(Q K^\top \) and the row-wise softmax, we need the matrix-level lifts of those rules: chain rule on matrices, additive fan-in on matrices, identity on matrices, plus matrix-specific moves like transpose, matmul-with-one-factor-fixed, and scalar scale. That’s the first section of theorems below (Matrix-level machinery, \(\sim \)15 theorems). Every one of them is a mechanical lift of a vector-level theorem from Chapter 2, proved via the row-projection \(\mathrm{ContinuousLinearMap}\) and the existing chain rule. Pay the cost once, reuse forever.

The second section (Attention proofs, \(\sim \)15 theorems) applies the matrix machinery to the actual attention math. Row-wise softmax—a per-row softmax applied to the \(N \times N\) score matrix—has the well-known closed-form Jacobian \(p_i(\delta _{ij} - p_i)\); we prove the row-wise lifting that says this holds across matrices. Scaled dot-product attention’s backward decomposes into three matmul-backward applications (one for each of \(Q\), \(K\), \(V\)) plus the row-wise softmax backward; we prove each. Then multi-head, then the attention sublayer (with residual), then the MLP sublayer (with residual), then a single transformer block, then a \(k\)-deep transformer tower, then the full ViT body. Each step is one or two compositions of the previous step. The chain that started at vjp_comp in Chapter 2 reaches all the way to a 5.5M-parameter image classifier here.

What’s actually proved

That’s the framework-side payoff of this chapter: every architectural piece used by any Part-1 network has a machine-checked backward. Once we land Theorem 81, the sentence “the gradient computed by this trainer is the mathematically correct one” is no longer a hope or a folklore result—it is a theorem the Lean kernel verifies on every build.

The empirical-side payoff comes after the theorems, in the Example section: ViT-Tiny trained on Imagenette underperforms the ConvNets, fairly dramatically, because 9,469 training images is too small for a transformer’s lack of inductive bias. That’s informative on its own terms (the data-hungry-transformer story made concrete) and sets up the augmentation ablation that closes the chapter.

10.1 Matrix-level machinery

Attention is fundamentally matrix algebra: queries times keys transposed (\(QK^T\)), softmax over rows of the resulting matrix, matmul against values (\(\cdot V\)). Before we can prove backward passes through these operations, we need the matrix-level extensions of Ch 2’s foundation rules. Each theorem below is the matrix lift of a vector-level theorem from Ch 2, proved by composing the vector version with the row-projection \(\mathrm{ContinuousLinearMap}\). The Lean proofs all live in Proofs/Tensor.lean alongside their vector counterparts; we collect them here because Ch 10 is the first and only chapter to use them.

Theorem 46 Row-independence for matrices
#

Applying a vector function per-row to a matrix has block-diagonal Jacobian. Proved from \(\operatorname {fderiv}\) via the row-projection \(\mathrm{ContinuousLinearMap}\) and the chain rule, given a \(\mathsf{Differentiable}\) hypothesis on the per-row function.

Proof

Mechanical; see Proofs.pdivMat_rowIndep.

Theorem 47 Matrix-level chain rule
#

Lifts vjp_comp via the \(\mathsf{Mat.flatten}\) bijection.

Proof

Mechanical; see Proofs.vjpMat_comp.

Theorem 48 Matrix-level additive fan-in
#
Proof

Mechanical; see Proofs.biPathMat_has_vjp.

Theorem 49 Matrix-level identity
#
Proof

Mechanical; see Proofs.identityMat_has_vjp.

Theorem 50 Scalar-scale Jacobian
#

Phase 6 derivation.

Proof

Mechanical; see Proofs.pdivMat_scalarScale.

Theorem 51 Transpose Jacobian
#
Proof

Mechanical; see Proofs.pdivMat_transpose.

Theorem 52 Matmul Jacobian, left factor fixed

Phase 6: derived, not axiomatized.

Proof
Theorem 53 Matmul Jacobian, right factor fixed
Proof
Theorem 54 Matmul VJP, left factor fixed
#
Proof
Theorem 55 Matmul VJP, right factor fixed
#
Proof
Theorem 56 Scalar-scale VJP
#
Proof

Mechanical; see Proofs.scalarScale_has_vjp.

Theorem 57 Transpose VJP
#
Proof

Mechanical; see Proofs.transpose_has_vjp.

Theorem 58 Row-wise lifting of any \(\mathsf{HasVJP}\)
#

Phase 8 generic helper: lifts any \(\mathsf{HasVJP}\, (g : \mathbb {R}^{n} \to \mathbb {R}^{p})\) to a \(\mathsf{HasVJPMat}\) on \(\mathbb {R}^{m \times n} \to \mathbb {R}^{m \times p}\).

Proof

Mechanical; see Proofs.rowwise_has_vjp_mat.

Theorem 59 3D chain rule
#
Proof

Mechanical; see Proofs.pdiv3_comp.

Theorem 60 3D VJP chain rule
#
Proof

Mechanical; see Proofs.vjp3_comp.

Theorem 61 3D additive fan-in
#
Proof

Mechanical; see Proofs.biPath3_has_vjp.

10.2 Attention proofs

Theorem 62 Softmax Jacobian
#

\(\partial \mathrm{softmax}(z)_j / \partial z_i = p_j (\delta _{ij} - p_i)\). Proved by extracting the j-th coordinate function \(z' \mapsto e^{z'_j} \cdot (\sum _k e^{z'_k})^{-1}\) and chaining HasFDerivAt.exp with (hasDerivAt_inv).comp_hasFDerivAt. At \(\mathbf{e}_i\) the sum collapses via \(\sum _k e^{z_k} \cdot \delta _{ki} = e^{z_i}\).

Proof

Mechanical; see Proofs.pdiv_softmax.

Theorem 63 Standalone softmax VJP
#

Closed-form collapse: \(dz_i = p_i (dy_i - \langle p, dy \rangle )\).

Proof

Mechanical; see Proofs.softmax_has_vjp.

Theorem 64 Row-wise softmax VJP on a matrix
#
Proof
Theorem 65 SDPA backward wrt Q

Composed from four proved \(\mathsf{HasVJPMat}\) building blocks.

Proof

Mechanical; see Proofs.sdpa_back_Q_correct.

Theorem 66 SDPA backward wrt K
Proof

Mechanical; see Proofs.sdpa_back_K_correct.

Theorem 67 SDPA backward wrt V
#
Proof

Mechanical; see Proofs.sdpa_back_V_correct.

Theorem 68 Multi-head SDPA VJP
#

Phase 3 (Apr 2026): proved via column-stacking, lifting the proved single-head SDPA backward (sdpa_back_*) over the head axis using the new pdivMat_colIndep + colSlabwise_has_vjp_mat framework.

Proof

Mechanical; see Proofs.mhsa_has_vjp_mat.

Theorem 69 MHSA layer smoothness
#

\(\mathsf{Differentiable}\) sibling of mhsa_has_vjp_mat, proved alongside the VJP via the same column-stacking framework.

Proof

Mechanical; see Proofs.mhsa_layer_flat_diff.

Theorem 70 Patch-embedding VJP
#

Phase 6b (Apr 2026): de-opaqued in 6a (concrete def unfolding to conv-with-stride + CLS prepend + positional embed), then proved as input VJP via spatial-rearrangement bridge to the proved conv2d_has_vjp3.

Proof
Theorem 71 Patch-embedding smoothness
#

\(\mathsf{Differentiable}\) sibling of patchEmbed_flat_has_vjp, proved via the same rearrangement bridge.

Proof

Mechanical; see Proofs.patchEmbed_flat_diff.

Theorem 72 Per-token LayerNorm lifted to a matrix
Proof
Theorem 73 Per-token dense lifted to a matrix
Proof
Theorem 74 Per-token GELU lifted to a matrix
Proof
Theorem 75 Row-wise softmax smoothness
#

\(\mathrm{rowSoftmax}\) (the flattened form) is \(\mathsf{Differentiable}\). Proved via Mathlib’s \(\mathrm{Real.exp}\) calculus: the denominator \(\sum _j \exp (M_{rj})\) is positive (by Finset.sum_pos), giving a Differentiable.inv application that finishes the chain.

Proof

Mechanical; see Proofs.rowSoftmax_flat_diff.

Theorem 76 Transformer MLP sublayer VJP

A vjpMat_comp chain of three per-token liftings. \(\mathsf{Differentiable}\) hypotheses are now threaded through the chain; proved.

Proof
Theorem 77 Transformer attention sublayer with residual VJP

\(\operatorname {biPathMat}\) of identity and \(\mathrm{mhsa} \circ \mathrm{LN_1}\). Diff hypotheses threaded; proved.

Proof
Theorem 78 Transformer MLP sublayer with residual VJP

Diff hypotheses threaded; proved.

Proof
Theorem 79 Transformer block VJP

One vjpMat_comp glueing the two sublayers. Proved.

Proof
Theorem 80 Transformer tower, any depth
#

\(k\)-fold induction via vjpMat_comp; covers ViT-Tiny / Base / Large.

Proof
Theorem 81 ViT body: the grand finale
#

The full ViT transformer backbone as one \(\mathsf{HasVJPMat}\): \(\mathrm{finalLN} \circ \mathrm{transformerTower}\) glued by vjpMat_comp.

Proof

Mechanical; see Proofs.vit_body_has_vjp_mat.

10.3 Example: ViT-Tiny on Imagenette

This is the capstone. Every layer proved in the preceding chapters — dense, ReLU, softmax cross-entropy, convolution, maxPool, batch norm, residual, depthwise conv, squeeze-and-excitation, layer norm, GELU, and the attention mechanism itself — composes into this one architecture. If the NetSpec below compiles and the VJP of its body is proved (§ 81), then every layer in modern deep learning you’ve seen in this book has its backward pass machine-checked.

ViT-Tiny (Dosovitskiy et al. 2020) was the smallest Vision Transformer variant in the original paper. It’s also the least data-hungry — the big ViT variants start beating ConvNets at ImageNet-21K scale and above. On a 9469-image Imagenette training set, ViT-Tiny underperforms the ResNet-34 / MobileNet V2 / EfficientNet-B0 from earlier chapters. That’s part of the teaching point: transformers are not a universal improvement, they are a different scaling regime.

The architecture

-- 1
import LeanMlir

-- 2
def vitTiny : NetSpec where
  name   := "ViT-Tiny"
  imageH := 224
  imageW := 224
  layers := [
    .patchEmbed 3 192 16 196,              -- (224/16)^2 = 196 patches
    .transformerEncoder 192 3 768 12,      -- dim=192, 3 heads, mlp=768, 12 blocks
    .dense 192 10 .identity                -- classification head
  ]

-- 3
def vitTinyConfig : TrainConfig where
  learningRate   := 0.0003     -- 3× smaller than the ConvNets
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 5          -- longer warmup too
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  vitTiny.train vitTinyConfig (args.head?.getD "data/imagenette")

Three layers. Literally three. The entire model body is patchEmbed, transformerEncoder, dense. .patchEmbed 3 192 16 196 chops the \(224 \times 224 \times 3\) image into \(14 \times 14 = 196\) non-overlapping \(16 \times 16\) patches, linear-projects each to 192 dimensions, and prepends a learned \(\mathrm{CLS}\) token. .transformerEncoder 192 3 768 12 runs 12 standard encoder blocks (multi-head self-attention + MLP + LayerNorm + residual, twice per block). .dense 192 10 .identity classifies off the CLS token. Every one of those sub-ops has its backward pass proved — the composition of all of them is § 81, the grand-finale theorem of Part 1.

Note the learning rate is 0.0003 (a third of what the ConvNets used) and warmup is 5 epochs (longer than the ConvNets’ 3). ViTs famously need gentler optimization schedules — the attention softmax is prone to collapse if gradients push logits too hard early, and warmup keeps the first few epochs slow enough to avoid that failure mode.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/vit-tiny-train
ViT-Tiny: 5526346 params
Generating train step MLIR...
  742145 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images (256×256)
  5526346 params + m + v (63 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.000300,
          cosine, label_smooth=0.1, wd=1e-4
  no BN (LayerNorm only)
  step 0/295: loss=2.302585 (366ms)
Epoch 10/80: loss=1.552973 lr=0.000298 (102247ms)
  val accuracy: 2121/3904 = 54.33%
Epoch 20/80: loss=1.384668 lr=0.000275 (100662ms)
  val accuracy: 2258/3904 = 57.84%
Epoch 40/80: loss=1.178420 lr=0.000172 (102352ms)
  val accuracy: 2642/3904 = 67.67%
Epoch 60/80: loss=1.042742 lr=0.000054 (102175ms)
  val accuracy: 2779/3904 = 71.18%
Epoch 80/80: loss=0.986243 lr=0.000000 (102234ms)
  val accuracy: 2799/3904 = 71.70%

(Full log in logs/vit_train.log.)

Final val accuracy 71.70% on Imagenette, wall time \(\sim \)2.3 hours — the fastest of any Imagenette-scale model in Part 1. Expanding the comparison table from the ConvNeXt chapter:

Model

Params

MLIR

Step time

Total

Val acc

ResNet-34

21.29M

518 KB

1400 ms

9.5 h

90.29%

MobileNet V2

2.24M

741 KB

830 ms

5.4 h

87.09%

EfficientNet-B0

7.16M

938 KB

940 ms

6.2 h

87.58%

ConvNeXt-T

27.83M

790 KB

2030 ms

13.3 h

84.94%

ViT-Tiny

5.53M

742 KB

360 ms

2.3 h

71.70%

ViT-Tiny is the fastest-per-step and the lowest-accuracy. That’s the data-hunger effect: 9469 training images is tiny for a transformer. The ViT paper reached ImageNet-competitive accuracy only when trained on ImageNet-21K (14M images) or JFT-300M (300M images); at Imagenette scale the ConvNets’ inductive bias (locality, translation equivariance) actively helps, and the ViT has to learn those properties from data it doesn’t have.

10.4 Data Augmentation

The ViT-Tiny config we just trained uses a bare base recipe — random crop + hflip on the data side, no Mixup / CutMix / RandAugment / heavy weight decay. Holding architecture and base optimizer fixed, we can measure the marginal effect of each modern augmentation knob layered on top, on the same Imagenette data and 80-epoch budget.

Cell

Val acc

\(\Delta \) vs bare

vit-tiny-cutmix-wd05

77.43%

\(+5.7\)

vit-tiny-cutmix

77.10%

\(+5.4\)

vit-tiny-mixup

74.69%

\(+3.0\)

vit-tiny-recipe2

74.41%

\(+2.7\)

vit-tiny-full

74.23%

\(+2.5\)

vit-tiny-randaug (M=9)

74.03%

\(+2.3\)

vit-tiny-randaug-m4 (M=4)

71.98%

\(+0.3\)

vit-tiny-bare

71.70%

vit-tiny-erase

70.62%

\(-1.1\)

(recipe2 = CutMix + Random Erasing + RandAugment-M9; full = Mixup + Random Erasing.)

  • CutMix is the load-bearing knob. +5.4% over bare. No close substitute; Mixup is a partial second at +3.0%.

  • Stacking on top of CutMix gives near-zero return. CutMix + heavy WD adds \(+0.3\% \), well within seed noise. CutMix + RA + RE (recipe2) is negative (\(-2.7\% \) vs CutMix alone): at 9.5K training images, piling on more aug erodes the per-image class signal faster than the diversity helps.

  • Random Erasing alone hurts (\(-1.1\% \)) at this scale. Removing 2–33% of pixels per image needs more redundant signal elsewhere than a 9.5K-image dataset has.

  • RandAugment magnitude matters. M=9 (paper default) works (\(+2.3\% \)). M=4 is essentially identity (\(+0.3\% \)); too gentle to bite.

The reading for a small-data practitioner: pick one strong knob (CutMix) and stop. The reading for a large-data practitioner: the same table at 1.2M images would lift different rows — the marginal effect of each knob is scale-dependent. The framing worth keeping is "marginal effect on top of what you already have," not the "stack everything" framing implicit in published paper recipes.

One seed per cell. The 0.3% delta between CutMix and CutMix+WD is below noise; the 5.4% delta from CutMix is well above.

10.5 ImageNet recipe

The Imagenette ViT-Tiny above trained on \(\sim \)9.5K images. The full 1000-class run is the same architecture — patch16, embed 192, 12 transformer blocks, 3 heads, \(\sim \)5.7M parameters — with a 1000-class head, the .imagenet dataset, and the DeiT-flavored training recipe. Here’s what the loop looks like:

-- 1. Same ViT-Tiny backbone, 1000-class head.
def vitTinyImagenet : NetSpec where
  name   := "ViT-Tiny (ImageNet, bf16)"
  imageH := 224
  imageW := 224
  layers := [
    .patchEmbed 3 192 16 196,            -- (224/16)^2 = 196 patches
    .transformerEncoder 192 3 768 12,    -- 12 blocks, 3 heads, MLP 768
    .dense 192 1000 .identity            -- 1000-class head
  ]

-- 2. DeiT recipe: AdamW, cosine, the augmentation suite, grad clipping.
def vitTinyImagenetConfig : TrainConfig where
  learningRate   := 5e-4            -- proper DeiT batch-512 LR
  batchSize      := 512
  epochs         := 80
  useAdam        := true            -- AdamW (decoupled weight decay)
  weightDecay    := 0.05
  cosineDecay    := true
  warmupEpochs   := 5
  labelSmoothing := 0.1
  gradClipNorm   := 1.0             -- the unlock for the 5e-4 LR (see below)
  useMixup       := true            -- DeiT aug: mixup + cutmix alternate
  useCutmix      := true
  bf16           := true            -- bf16 matmul compute

-- 3. Same .train entry point as every other chapter.
def main (args : List String) : IO Unit :=
  vitTinyImagenet.train vitTinyImagenetConfig
    (args.head?.getD "data/imagenet") .imagenet

As with the ResNet-34 ImageNet run (Ch 6), the phase-2 JAX path (jax/MainVitImagenet.lean, via lake build vit-tiny-imagenet) wires .imagenet through tfds streaming; the phase-3 IREE path awaits the same C-side streaming reader for the 1.28M-image set. Only the data-loading layer differs — the spec, the recipe knobs, and the .train invocation are identical.

The stability knob. ViT-Tiny is where the recipe stops being optional. At the proper DeiT learning rate (peak \(5\times 10^{-4}\) for batch 512) the model collapses to chance the moment warmup ramps the LR past \({\sim }1.6\times 10^{-4}\) — train loss pins at \(\ln (1000)\approx 6.9\) and never recovers. This is the well-known no-grad-clip transformer instability, sharpened here by the 1000-class softmax and bf16 gradient noise. The fix is one field: gradClipNorm := 1.0, global-norm gradient clipping, the DeiT default. With it, the very LR that collapsed without it trains cleanly through warmup and on to convergence.

bf16 and the run. ViT is matmul-bound — patch embed, attention QKV/scores/output, the MLP, the head are all dense matrix multiplies, with no convolution to speak of. So unlike the ResNet case, the relevant precision flag is bf16 alone (matmul compute in bfloat16, fp32 master weights, fp32 LayerNorm/softmax/GELU); there is no bf16Conv to set. bf16 matmul runs through the same NVIDIA tensor cores. Measured on the same four RTX 4060 Ti (CUDA) at batch 512, bf16 is \(\mathbf{1.48\times }\) faster per step than fp32 (\(176\) vs \(260\) ms steady-state) — a smaller multiple than the convnet’s \(1.6\times \), since ViT-Tiny’s matmuls are narrow (embed \(192\)) and spend proportionally more time in the fp32 LayerNorm/softmax that bf16 leaves untouched. The run:

GPU

Precision

Per epoch

80 epochs

Val top-1

Val top-5

4\(\times \) 4060 Ti (CUDA)

fp32

\(\sim \)11.1 min

\(\sim \)15 hr

4\(\times \) 4060 Ti (CUDA)

bf16

\(\sim \)7.6 min

\(\sim \)11 hr

\(\mathbf{65.64\% }\)

\(\mathbf{87.06\% }\)

(The fp32 row is throughput-only — steady-state ms/step measured over a few hundred steps, projected to 80 epochs; only the bf16 configuration was trained to completion. bf16 does not cost final accuracy here.)

The 80-epoch run reached \(\mathbf{65.64\% }\) top-1 / \(\mathbf{87.06\% }\) top-5 on the full 50,000-image validation split. That is below the \({\sim }72\% \) ViT-Tiny is capable of — but that headline is a 300-epoch DeiT result with the full distillation-free recipe; at 80 epochs, \({\sim }66\% \) is the expected band, and the point here is the pipeline (proven attention backward \(\to \) DeiT recipe \(\to \) stable bf16 training to a sensible ImageNet number), not a SOTA chase. The validation curve shows the familiar shape — fast climb through warmup, a long middle, a final lift as the cosine schedule anneals to zero:

\begin{tikzpicture} 
\begin{axis}[
    width=0.92\linewidth, height=6.5cm,
    xlabel={Epoch}, ylabel={Validation accuracy (\%)},
    xmin=0, xmax=81, ymin=0, ymax=95,
    xtick={0,10,20,30,40,50,60,70,80},
    ytick={20,40,60,80},
    legend pos=south east,
    legend cell align={left},
    grid=major, grid style={gray!18},
    tick label style={font=\small},
    label style={font=\small},
    every axis plot/.append style={line width=1pt, mark size=1pt},
]
\addplot[blue, mark=*, mark options={fill=blue}] coordinates {
(1,1.79) (2,4.78) (3,7.20) (4,10.64) (5,15.34) (6,20.05) (7,24.64) (8,28.83) (13,40.52) (14,41.53) (15,43.21) (16,44.33) (20,47.87) (21,49.00) (22,49.64) (23,50.11) (24,50.27) (30,53.26) (34,55.10) (35,55.70) (36,56.27) (37,56.34) (38,56.64) (46,59.11) (52,60.78) (53,61.25) (61,63.37) (69,64.76) (73,65.40) (74,65.47) (75,65.53) (78,65.66) (79,65.67) (80,65.68)
};
\addlegendentry{top-1}
\addplot[orange, mark=*, mark options={fill=orange}] coordinates {
(1,6.43) (2,13.98) (3,19.60) (4,26.20) (5,33.91) (6,41.79) (7,47.63) (8,53.16) (13,66.33) (14,67.47) (15,68.91) (16,69.56) (20,73.48) (21,74.21) (22,74.73) (23,75.47) (24,75.40) (30,77.91) (34,79.37) (35,80.08) (36,80.54) (37,80.66) (38,80.86) (46,82.49) (52,83.88) (53,84.03) (61,85.50) (69,86.54) (73,86.97) (74,86.95) (75,87.04) (78,87.07) (79,87.08) (80,87.08)
};
\addlegendentry{top-5}
\end{axis}
\end{tikzpicture}

ViT-Tiny / ImageNet-1k validation accuracy per epoch (bf16, 4\(\times \) 4060 Ti; sampled across the run, which was resumed from per-epoch checkpoints five times when the host’s PCIe links threw transient errors under load — the checkpoint/resume harness absorbed each without a lost epoch).

What Part 1 has established

Every layer primitive shipped in Part 1’s trainers (MLP, CNN, ResNet, MobileNet, EfficientNet, ViT) has a machine-checked backward pass. The same NetSpec / TrainConfig / .train pipeline trains all of them with at most config-level changes. The MLIR each emits is between 20 KB (MLP) and 938 KB (EfficientNet-B0), and IREE compiles all of them the same way.

The bestiary in Part 2 then shows that every other architecture you’ve seen in modern deep learning — UNet, DETR, Mask R-CNN, DCGAN, CycleGAN, Pix2Pix, DDPM, Stable Diffusion, VAE, AlphaGo, AlphaZero, MuZero, Mamba, BERT, GPT, Whisper, CLIP, LLaVA, SAM, SegFormer, DeepLab v3+, Nyström-former, QANet, Evoformer — composes from the same small set of primitives plus a handful of architecture-specific bundled layers. Three rules (chain, additive fan-in, multiplicative fan-in). Five Jacobian tricks (diagonal, sparse Toeplitz, binary selection, rank-1 correction, outer product).

That’s the whole framework. Part 2’s bestiary takes the same set of primitives and tours how they compose into every well-known modern architecture — a survey of the field expressed in the proven kit, with a small library of bundled idioms for the patterns that recur.