Verified Deep Learning with Lean 4

6 ResNet-34

Deeper networks used to stop training

Before 2015, the dominant assumption in image classification was “deeper is better, up to a point.” The “up to a point” part was not metaphorical. If you took a 20-layer CNN that trained to 70% val accuracy and added 12 more convolutional layers, you didn’t get a 32-layer model with 72% accuracy. You got a 32-layer model that, after the same training budget, fit the training set worse than the 20-layer one did. The 32-layer model had strictly more capacity. It still trained worse. Something about deep stacks of convolutions made gradient flow degrade in a way that the optimizer couldn’t navigate around.

He et al. 2015 (arXiv:1512.03385) had the right observation: if the 12 extra layers in the deeper network were initialized to compute the identity, the 32-layer network would have at least the 20-layer’s performance, because it could replicate it exactly. So the question was not “can a deep network represent what a shallow one represents?”—it provably could. The question was: why is optimizing 12 free-form layers to recover identity so much harder than just leaving identity there?

The residual block: ask for the difference, not the function

Their fix was a one-line architectural change. Instead of asking each block to learn a function \(y = f(x)\), ask it to learn a function \(y = f(x) + x\). The block’s job is no longer to become the new representation. Its job is to learn what to add to the current representation. At initialization, with randomly-small \(f\), each block is approximately the identity, and a stack of 16 such blocks is approximately a stack of 16 identities—which is to say, approximately the input. The network starts out near-identity at every depth and only gradually takes on structure as \(f\) trains. The optimization problem becomes “what residual should each block add” instead of “what function should each block be,” and the former turns out to be dramatically easier.

The change to the architecture is one + sign. The change to the training dynamics is the difference between “18 layers is about as deep as we can go” and “152 layers trains stably; let us also add another zero on the ImageNet leaderboard.”

Why the proof is a one-liner

From the framework’s perspective, what is a residual block? \(\mathrm{residual}(x) = f(x) + \mathrm{id}(x)\)—a sum of two functions sharing the same input. We already proved in Chapter 2 that the VJP of an additive fan-in is “send the upstream gradient through both branches and add.” We already proved that the VJP of \(\mathrm{id}\) is the identity on the gradient (and that the VJP of \(f\) is whatever the inner block’s machinery gives us). Composition is just \(\mathrm{vjp\_ comp}\). So the residual VJP is the additive fan-in lemma, with \(\mathrm{id}\) plugged into one slot.

That is Theorem 35 below, and its proof is one line. This chapter has exactly one theorem, and that theorem is mechanical. The engineering revolution was not in the math—it was in the realization that this trivially provable operation was the missing piece for training deep networks. The fact that the framework absorbs the new architecture without any new math is the value proposition: once the foundation rules are in place, the next decade of image recognition follows from composing them.

ResNet-34 is sixteen of these blocks, stacked

ResNet-34 (the smallest “deep” variant in the original paper) is built from 16 residual blocks arranged in four stages of depths 3, 4, 6, 3, at channel widths 64, 128, 256, 512. Each stage starts with a stride-2 downsample (halving spatial resolution) before its stack of stride-1 blocks. The stages are glued together by a stem (a 7\(\times \)7 conv at stride 2 plus a max-pool, taking 224\(\times \)224\(\times \)3 input to 56\(\times \)56\(\times \)64 features) and a head (global average pool plus a 512-to-10 dense layer). Total: 34 weight-bearing layers, 21.3M parameters.

Two pieces of this template are new and worth naming:

  1. The 7\(\times \)7 stride-2 stem. Earlier chapters’ networks operated at full input resolution; ResNet aggressively downsamples at the input so the bulk of compute happens at lower spatial resolution. The stem is one big conv that gets us from 224\(\times \)224 to 56\(\times \)56 in a single step.

  2. Global average pool instead of flatten-plus-dense. Chapters 4 and 5 ended in a .flatten followed by a \(\sim \)4096-to-512 dense layer that contained tens of millions of parameters. ResNet collapses the final 7\(\times \)7\(\times \)512 feature map to a 512-vector by spatial averaging, then runs a single 512-to-10 dense. No flatten layer. The architectural move alone drops the model from what would have been a \(\sim \)200M-parameter AlexNet-era spec down to 21.3M.

Every chapter after this one is a variation on this template. ConvNeXt swaps in different blocks. MobileNetV2 swaps in depthwise-separable blocks. EfficientNet adds compound scaling. ViT replaces the whole convolutional spine with patches and attention. The stem-stages-head template, and the residual fan-in pattern, persist.

The theorem

Theorem 35 Residual block VJP
#

Residual = \(f + \mathrm{id}\).

Proof

Mechanical; see Proofs.residual_has_vjp.

6.1 Example: ResNet-34 on Imagenette

Same template as the MNIST and CIFAR chapters, but at the scale the architecture was designed for. ResNet-34 on 224\(\times \)224 Imagenette — 10 classes, real photographs, the transition point from “classroom dataset” to “actual image recognition.”

The residual block itself is the whole trick: instead of \(y = f(x)\), compute \(y = f(x) + x\). That’s the additive-fan-in pattern of § 10 with \(\mathrm{id}\) on one branch, which is the composition § 35 proves. Stacking 34 layers worth of these blocks just composes the fan-in rule 34 times (well, 16 residual blocks deep and 34 convs across all of them).

The architecture

-- 1
import LeanMlir

-- 2
def resnet34 : NetSpec where
  name   := "ResNet-34"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 64 7 2 .same,         -- stem
    .maxPool 2 2,
    .residualBlock  64  64 3 1,     -- stage 1 — 3 blocks
    .residualBlock  64 128 4 2,     -- stage 2 — 4 blocks
    .residualBlock 128 256 6 2,     -- stage 3 — 6 blocks
    .residualBlock 256 512 3 2,     -- stage 4 — 3 blocks
    .globalAvgPool,                 -- replaces flatten + fat dense
    .dense 512 10 .identity
  ]

-- 3
def resnet34Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  resnet34.train resnet34Config (args.head?.getD "data/imagenette")

First chapter that uses the full production recipe (Adam + cosine + warmup + weight decay + augmentation + label smoothing) rather than the s4tfBaseline. Also the first chapter where .globalAvgPool replaces the flatten-plus-fat-dense head. Global average pooling (GAP) collapses each \(C \times H \times W\) feature map to a length-\(C\) vector by averaging over the spatial dimensions,

\[ \mathrm{gap}(x)_c \; =\; \frac{1}{HW} \sum _{i, j} x_{c, i, j}, \]

so the classifier head sees one summary scalar per channel instead of the full \(C \cdot H \cdot W\) flattened activation. That’s the architectural move that drops ResNet-34 from what would have been a \(\sim \)200M parameter AlexNet-era spec down to 21.3M. We’ll see GAP again as the “Squeeze” step inside the squeeze-and-excitation block in Chapter 8, and at the head of every post-2015 vision architecture in the bestiary.

Results

Build and run:

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/resnet34-train
ResNet-34: 21289802 params
Generating train step MLIR...
  517912 chars
Compiling vmfbs...
  forward compiled
  eval forward (fixed BN) compiled
  compiled
  session loaded
  train: 9469 images (256×256)
  21289802 params + m + v (243 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 36, BN stat floats: 17024
  step 0/295: loss=3.581499 (1478ms)
Epoch  1/80: loss=1.812683 lr=0.000333 (423692ms)
Epoch  2/80: loss=1.512727 lr=0.000667 (423510ms)
Epoch  3/80: loss=1.393462 lr=0.001000 (425071ms)
...
Epoch 78/80: loss=0.502628 lr=0.000004 (426600ms)
Epoch 79/80: loss=0.502615 lr=0.000002 (427378ms)
Epoch 80/80: loss=0.502586 lr=0.000000 (427151ms)
  val accuracy (running BN): 3525/3904 = 90.29%

(Full log in logs/r34_bn.log.)

Eighty epochs, about 425 seconds per epoch on the 7900 XTX (\(\sim \)7 minutes), total wall time around 9.5 hours. Final test accuracy 90.29% on Imagenette — the going rate for scratch-trained ResNet-34 at this resolution and data size.

Some numbers worth noticing:

- 21.3M params, but only about 517 KB of MLIR (517 912 chars) for the entire training step — roughly 25\(\times \) the MLP, 24\(\times \) the no-BN CIFAR CNN. The network deepened by 10\(\times \) more layers, but the emitted compute only grew linearly in depth: IREE’s code generation is per-op, not per-parameter, so 3\(\times \)3 convs at different channel widths all emit similar-sized StableHLO.

- Per-step time is 1.4 s, dominated by the four .residualBlock stages (their \(3 \times 3\) convs at 64, 128, 256, 512 channels are where the FLOPs live). The first block alone at 64 channels does \(3 \times 3 \times 64 \times 64\) per output spatial location — about \(3.4 \times 10^8\) FLOPs per forward image, times the batch of 32, times three blocks at that width, times 295 batches per epoch, times two (forward and backward).

- Final loss plateaus near 0.50, not zero. That’s the label-smoothing floor: with \(\epsilon = 0.1\) and 10 classes, the minimum possible cross-entropy against a smoothed target is \(-0.9 \log 0.9 - 0.1 \log (0.1/9) \approx 0.485\). The 0.50 value at epoch 80 means the model has essentially converged; more epochs would mostly just extend training time without budging the loss.

- BN layers: 36. Every residual block contains two convBn layers, and there are 16 residual blocks across the four stages (3+4+6+3). Plus the stem convBn. That’s 33 from the residualBlock primitive expansions plus 3 stem/projection layers, for 36 total. Each one proves its own VJP via § 34 and composes with the rest of the network via § 9.

6.2 What’s in the production recipe?

Six ingredients first appear here, none of which are layers — they all live in TrainConfig or training code. They don’t touch the network, but they’re what moves a scratch-trained model from 72.82% (plain SGD + momentum, no other tricks) to 90.29%.

Adam (Kingma & Ba, 2014, arXiv:1412.6980). Replaces vanilla SGD. Maintains per-parameter running estimates of the first and second moments of the gradient and uses them for an adaptive per-parameter learning rate. Much faster convergence than SGD on most tasks, much more forgiving of learning-rate choice.

Cosine learning-rate schedule (Loshchilov & Hutter, 2016, arXiv:1608.03983). After warmup, the learning rate decays from its peak to near-zero following \(\mathrm{lr}(t) = \mathrm{lr}_{\max } \cdot \tfrac {1}{2}(1 + \cos (\pi t / T))\). Smoother than step schedules; consistently produces slightly better final losses in practice. Paired with warmup and progressive resizing (Howard et al., 2018), this was the winning recipe at Stanford’s DAWNBench in 2018; the lineage traces back to Leslie Smith’s cyclical learning rates (Smith 2015, arXiv:1506.01186). Popularized for ImageNet-scale training by He et al. 2018’s “Bag of Tricks” (arXiv:1812.01187).

Warmup (Goyal et al., 2017, arXiv:1706.02677). Linearly ramps the learning rate from zero to its peak over the first few epochs (3 for ResNet-34). Stabilizes early training when gradients are still reorganizing randomly-initialized weights. Becomes essential for larger models and larger batches — by ViT-Tiny in Chapter 10 it’s a necessity.

Weight decay. Adds a small L2 shrink to all trainable weights each step: \(w \leftarrow w - \eta \lambda w\). Regularizer, discourages weights from growing without bound. Loshchilov & Hutter’s “decoupled” variant (AdamW, 2017, arXiv:1711.05101) is what actually ships in modern pipelines, and it’s what TrainConfig.weightDecay implements.

Augmentation. Our pipeline does random crops (\(256 \to 224\)) and random horizontal flips on Imagenette; hflip only on CIFAR; nothing on MNIST. Lives in the data pipeline, not the network. We will explore data augmentation in chapters 9 and 10.

Label smoothing (Szegedy et al., 2016, arXiv:1512.00567). Replaces the one-hot target with a smoothed \((\varepsilon / (K-1), \ldots , 1 - \varepsilon , \ldots )\). Stops the network from producing arbitrarily large logits on the correct class, which keeps training stable and improves calibration. Explains the 0.50 loss floor in the Results section above: \(-0.9 \log 0.9 - 0.1 \log (0.1/9) \approx 0.485\).

Every chapter after this one uses the same six ingredients. The NetSpec changes; the recipe stays.

6.3 Ablation: what each ingredient contributes

So how much does each ingredient actually earn? We ran the full recipe plus six leave-one-out variants on ResNet-34 Imagenette, 80 epochs each, same init and batch order. Each ablation row keeps everything except the one named component, so the lift measures that component’s contribution given everything else is present — an honest answer to “is this piece pulling its weight?” rather than the order-dependent story you’d get from an additive ladder.

Run

Val accuracy

\(\Delta \) vs full

Full recipe

90.29%

Bare recipe (plain SGD + momentum, no other tricks)

72.82%

\(-\)17.47

Full minus basic augmentation

82.71%

\(-\)7.58

Full minus cosine decay

86.81%

\(-\)3.48

Full minus Adam (vanilla SGD, lr 0.01)

87.04%

\(-\)3.25

Full minus warmup

88.88%

\(-\)1.41

Full minus label smoothing

89.24%

\(-\)1.05

Full minus weight decay

89.68%

\(-\)0.61

Three observations worth naming:

Augmentation is by far the largest single contribution (\(-\)7.58 points, more than 2\(\times \) any other knob). Imagenette has just 9469 training images for a 21M-parameter network, so without augmentation the model overfits hard — training loss drops below \(0.001\) while val accuracy stalls. Augmentation is doing most of the work that the rest of the recipe gets credit for.

Cosine decay and Adam are roughly tied for largest non-augmentation contribution (\(-\)3.48 and \(-\)3.25 points). They operate on different axes — cosine shapes the learning-rate trajectory, Adam shapes the per-parameter update magnitude — but their late-training influence on val accuracy is comparable.

The contributions are essentially additive. Sum of all six leave-one-out deltas: \(-17.38\) points. Bare-recipe delta vs full: \(-17.47\) points. The two agree to within rounding, suggesting the modern recipe’s ingredients aren’t significantly interacting — each one buys roughly the same lift independently of which others are present. Weight decay and label smoothing are the smallest contributors (\(-\)0.61 and \(-\)1.05) but show the overfit-then-catch-up signature mid-training (e.g., no-WD lands \(-\)6 points at epoch 10 before the gap narrows by epoch 80).

A note on the SGD configs: r34-no-adam replaces Adam with vanilla SGD at lr=0.01 (no momentum), keeping everything else from the full recipe. r34-bare uses SGD with momentum 0.9 at lr=0.01 and turns all other tricks off. The two SGD configs therefore differ on momentum as well as on the recipe ingredients, so the bare row should be read as a “what does the modern recipe buy us total” baseline rather than a strict additive component of the leave-one-out chain.

6.4 ImageNet recipe

The Imagenette runs above all trained on \(\sim \)9.5K images for 80 epochs. The headline ResNet result — 21.8M parameters reaching 74.4% top-1 on full 1000-class ImageNet — comes from a different recipe: 1.28M training images, 90 epochs, paper-recipe SGD with momentum. The architecture is unchanged; only the dataset, the output head width, and the training schedule differ. Here’s what the loop looks like:

-- 1. Same backbone, 1000 classes instead of 10.
def resnet34Imagenet : NetSpec where
  name   := "ResNet-34 (ImageNet)"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 64 7 2 .same,
    .maxPool 3 2,
    .residualBlock  64  64 3 1,
    .residualBlock  64 128 4 2,
    .residualBlock 128 256 6 2,
    .residualBlock 256 512 3 2,
    .globalAvgPool,
    .dense 512 1000 .identity      -- 1000-class head
  ]

-- 2. Paper recipe (SGD + momentum) plus modern polish.
def resnet34ImagenetConfig : TrainConfig where
  learningRate := 0.1
  batchSize    := 256
  epochs       := 90
  useAdam      := false             -- paper: SGD + momentum
  momentum     := 0.9
  weightDecay  := 1e-4
  cosineDecay  := true              -- replaces paper's step decay at ep 30/60
  warmupEpochs := 5                 -- modern addition
  augment      := true              -- random-resized-crop + horizontal flip
  labelSmoothing := 0.1             -- modern addition

-- 3. Same .train entry point as every other chapter.
def main (args : List String) : IO Unit :=
  resnet34Imagenet.train resnet34ImagenetConfig
    (args.head?.getD "data/imagenet") .imagenet

Today, the phase-2 JAX path (jax/MainResnetImagenet.lean, runs via lake build resnet34-imagenet) wires .imagenet through tfds streaming. The phase-3 IREE path shown in the snippet above still needs a C-side streaming reader for the 1.28M-image training set (Imagenette’s 9.5K fits in a single ByteArray; ImageNet doesn’t) — until that lands, the phase-3 dispatcher panics with a “use phase 2” note when handed .imagenet. What’s worth noting is that everything else — the architecture, the .train invocation, the recipe knobs — is identical across both paths; only the data-loading layer differs. The framework isn’t doing anything fundamentally different at ImageNet scale; it’s just the data that’s bigger.

Compute budget. The phase-2 (Lean\(\to \)JAX) path trains resnet34Imagenet on two AMD Radeon RX 7900 XTX GPUs (consumer hardware: gfx1100, ROCm 7.2, JAX 0.10.0, multi-GPU sharding via Mesh + NamedSharding(P(’batch’)) with RCCL preloaded). At batch 256 (128 per device), one epoch over the 1.28M-image training set runs in about 24 minutes of compute plus a 10-second val pass; a 30-epoch run finishes in roughly 15 wall-clock hours including first-epoch JIT compile. Final result on the standard 50K-image validation split is \(\mathbf{69.26\% }\) top-1 (val_loss \(1.232\)). The parameter checkpoint (jax_r34_imagenet.bin, 87 MB, 37 param groups) is written via the same params_to_file emit that phase-3 IREE consumes — so the JAX-trained weights drop into a phase-3 TrainConfig.bootstrapBackbone field byte-for-byte without a separate conversion step.

GPU

Precision

Per epoch

30 epochs

Val top-1

Val top-5

2\(\times \) 7900 XTX (ROCm)

fp32

\(\sim \)24.2 min

\(\sim \)15 hr

\(\mathbf{69.26\% }\)

(No bf16 row here: on AMD’s MIOpen, bf16 convolution is no faster than fp32 — see the CUDA discussion below — so the ROCm run stays fp32.)

The recipe trims the paper’s 90 epochs to 30 (SGD with momentum \(0.9\), batch 256, cosine LR with a 5-epoch warmup from peak \(0.1\), label smoothing \(0.1\), Inception-style random-resized-crop (sampling \(8\)–\(100\% \) of the image area at a \(\frac{3}{4}\)–\(\frac{4}{3}\) aspect ratio, then resizing to \(224\times 224\)) plus a horizontal flip, \(\ell _2\) weight decay \(10^{-4}\)) — a deliberate trade for roughly \(95\% \) of the paper’s \(\sim \)73% headline at \(\sim \)1/3 the compute.

bf16 and the full 90-epoch run. Running the whole schedule is a question of patience, not capability — and bf16 mixed precision roughly halves the wait. On a second box (four usable NVIDIA RTX 4060 Ti, CUDA/cuDNN), the same resnet34Imagenet spec with bf16 := true casts both the matmuls and the convolutions to bfloat16 while keeping fp32 master weights. NVIDIA’s tensor cores run bf16 convolution about \(1.6\times \) faster than fp32 — the opposite of AMD’s MIOpen, where bf16 conv is a wash, which is why conv precision is gated behind its own bf16Conv flag (matmul-only bf16 helps transformers on either vendor; bf16 conv only pays off on cuDNN). The wall-clock difference over a full 90-epoch run, batch 256 across the four GPUs:

GPU

Precision

Per epoch

90 epochs

Val top-1

Val top-5

4\(\times \) 4060 Ti (CUDA)

fp32

\(\sim \)16.8 min

\(\sim \)25 hr

4\(\times \) 4060 Ti (CUDA)

bf16

\(\sim \)10.5 min

\(\sim \)16 hr

\(\mathbf{72.02\% }\)

\(\mathbf{90.62\% }\)

The bf16 run finished all 90 epochs in about 16 wall-clock hours and reached \(\mathbf{72.02\% }\) top-1 / \(\mathbf{90.62\% }\) top-5 on the full 50,000-image validation split — squarely on the paper’s ResNet-34 mark, in bf16, on consumer GPUs. (fp32 was measured at the matched \(1.6\times \)-slower throughput but not run to completion; bf16 does not cost final accuracy here.) The \(1.6\times \) figure tracks a microbenchmark of the bare conv operator at this network’s stage-1 shape almost exactly, so the speedup is the convolution itself going through tensor cores, not a scheduling artifact.

The validation curve over the run tells the usual story: a fast climb through the warmup and high-LR phase, a long shallow middle, and a final lift as the cosine schedule anneals the learning rate to zero.

\begin{tikzpicture} 
\begin{axis}[
    width=0.92\linewidth, height=6.5cm,
    xlabel={Epoch}, ylabel={Validation accuracy (\%)},
    xmin=0, xmax=91, ymin=10, ymax=95,
    xtick={0,15,30,45,60,75,90},
    ytick={20,40,60,80},
    legend pos=south east,
    legend cell align={left},
    grid=major, grid style={gray!18},
    tick label style={font=\small},
    label style={font=\small},
    every axis plot/.append style={line width=1pt, mark size=1pt},
]
\addplot[blue, mark=*, mark options={fill=blue}] coordinates {
(1,15.25) (2,29.16) (3,37.00) (4,40.99) (5,42.79) (6,45.70) (7,47.22) (8,48.63) (9,48.70) (10,49.19) (11,50.36) (12,50.45) (14,51.33) (15,51.84) (16,51.97) (17,51.84) (18,52.52) (20,53.23) (21,53.34) (22,53.54) (23,53.40) (24,54.19) (25,54.15) (26,54.17) (27,54.08) (28,54.70) (29,55.17) (31,55.06) (32,55.37) (33,55.45) (34,55.42) (35,56.37) (37,56.53) (38,56.95) (39,56.44) (40,57.35) (41,57.19) (43,57.82) (44,58.03) (45,58.35) (46,58.76) (47,58.92) (49,59.34) (50,59.32) (51,60.14) (52,59.88) (53,60.53) (55,61.14) (56,61.24) (57,61.68) (58,61.69) (59,62.77) (61,63.26) (62,63.25) (63,63.92) (64,64.59) (65,64.50) (67,65.08) (68,65.05) (69,65.89) (70,66.35) (71,66.85) (73,67.43) (74,68.00) (75,68.11) (76,68.68) (77,69.27) (78,69.50) (79,69.81) (80,70.28) (81,70.40) (82,70.94) (83,71.15) (85,71.60) (86,71.93) (87,71.98) (88,72.12) (89,72.06) (90,72.08)
};
\addlegendentry{top-1}
\addplot[orange, mark=*, mark options={fill=orange}] coordinates {
(1,35.44) (2,55.47) (3,64.13) (4,68.26) (5,69.58) (6,72.41) (7,73.63) (8,74.74) (9,74.88) (10,75.76) (11,76.57) (12,76.57) (14,77.24) (15,77.50) (16,77.78) (17,77.42) (18,78.07) (20,78.76) (21,78.82) (22,78.49) (23,78.66) (24,79.17) (25,79.31) (26,79.23) (27,79.40) (28,79.53) (29,80.24) (31,79.79) (32,80.07) (33,80.14) (34,80.43) (35,81.07) (37,81.00) (38,81.18) (39,81.11) (40,81.63) (41,81.57) (43,81.84) (44,82.01) (45,82.25) (46,82.64) (47,82.93) (49,82.97) (50,83.02) (51,83.49) (52,83.27) (53,83.70) (55,84.19) (56,84.34) (57,84.45) (58,84.50) (59,85.24) (61,85.38) (62,85.47) (63,85.86) (64,85.98) (65,86.42) (67,86.78) (68,86.72) (69,87.29) (70,87.52) (71,87.68) (73,88.30) (74,88.32) (75,88.33) (76,88.74) (77,89.03) (78,89.28) (79,89.34) (80,89.60) (81,89.71) (82,89.86) (83,90.14) (85,90.34) (86,90.48) (87,90.61) (88,90.61) (89,90.66) (90,90.63)
};
\addlegendentry{top-5}
\end{axis}
\end{tikzpicture}

ResNet-34 / ImageNet-1k validation accuracy per epoch (bf16, 4\(\times \) 4060 Ti). The kink near epoch 5 is the end of warmup; the steady late climb is the cosine anneal.

A reasonable stopping point

If you’re skimming this book to decide whether the framework delivers — does Lean actually verify modern deep learning end to end? — Chapter 6 is the answer. ResNet-34 has 21M parameters, 36 residual blocks, batch normalization, the full production training recipe, and it trains to 90.29% val on Imagenette through MLIR generated by the same Lean code that proves the gradients. Every operator’s backward pass is formalized; every chain-rule step through the architecture is type-checked.

For the fastest path through this book, the route is roughly:

  • Ch 1–2: framework setup (mandatory context).

  • Ch 3–4: MLP and CNN worked examples — the simplest cases of the framework reaching real architectures.

  • Skip Ch 5 (BatchNorm). It’s structurally the hardest chapter and the prose says so up front. The math is right (pdiv_bnNormalize’s three-term cancellation is formalized and matches the codegen), but it’s not the place to spend time if you’re not specifically interested in BN.

  • Read this chapter (Ch 6, ResNet-34). Residual blocks are where the chain rule shows it can carry composition through arbitrarily deep architectures; the proof reduces to one additive-fan-in lemma you’ve already seen.

  • Stop here. Assume Chapters 710 (MobileNet, EfficientNet, ConvNeXt, ViT) and the Bestiary (Ch 11) are correct — they all reuse the same framework with new operators slotting into the same proof tree. Appendix B walks through how that claim is auditable if you ever want to spot-check it.

Reading through Ch 6 is enough to know the framework works.

Or, for breadth.

If you want to cover most of modern deep learning instead of stopping short, the fuller path is to keep reading through Ch 10 (ViT) for the full image-classification arc and then run the trainers in demos/:

lake exe yolov1-voc-train-bootstrap  # detection (YOLO on Pascal VOC)
lake exe unet-pets-train         # segmentation (UNet on Pets)
lake exe cifar-ddpm-train        # diffusion generative models
lake exe gradcam                 # explainability
lake exe tinygpt-shakespeare train   # NLP (char-level transformer)

The arc from ViT to TinyGPT in particular reuses the attention proofs from Ch 10 verbatim — image-token attention and character-token attention are the same operator, just at different token alphabets. Image classification + detection + segmentation + diffusion + NLP, all on the same Lean\(\to \)MLIR\(\to \)IREE pipeline, covers most of what people mean by “modern deep learning.”