5 ResNet-34

Deeper networks used to stop training

Before 2015, the dominant assumption in image classification was “deeper is better, up to a point.” The “up to a point” part was not metaphorical. If you took a 20-layer CNN that trained to 70% val accuracy and added 12 more convolutional layers, you didn’t get a 32-layer model with 72% accuracy. You got a 32-layer model that, after the same training budget, fit the training set worse than the 20-layer one did. The 32-layer model had strictly more capacity. It still trained worse. Something about deep stacks of convolutions made gradient flow degrade in a way that the optimizer couldn’t navigate around.

He et al. 2015 (arXiv:1512.03385) had the right observation: if the 12 extra layers in the deeper network were initialized to compute the identity, the 32-layer network would have at least the 20-layer’s performance, because it could replicate it exactly. So the question was not “can a deep network represent what a shallow one represents?”—it provably could. The question was: why is optimizing 12 free-form layers to recover identity so much harder than just leaving identity there?

The residual block: ask for the difference, not the function

Their fix was a one-line architectural change. Instead of asking each block to learn a function $y = f(x)$, ask it to learn a function $y = f(x) + x$. The block’s job is no longer to become the new representation. Its job is to learn what to add to the current representation. At initialization, with randomly-small $f$, each block is approximately the identity, and a stack of 16 such blocks is approximately a stack of 16 identities—which is to say, approximately the input. The network starts out near-identity at every depth and only gradually takes on structure as $f$ trains. The optimization problem becomes “what residual should each block add” instead of “what function should each block be,” and the former turns out to be dramatically easier.

The change to the architecture is one + sign. The change to the training dynamics is the difference between “18 layers is about as deep as we can go” and “152 layers trains stably; let us also add another zero on the ImageNet leaderboard.”

Why the proof is a one-liner

From the framework’s perspective, what is a residual block? $\mathrm{residual}(x) = f(x) + \mathrm{id}(x)$—a sum of two functions sharing the same input. We already proved in Chapter 1 that the VJP of an additive fan-in is “send the upstream gradient through both branches and add.” We already proved that the VJP of $\mathrm{id}$ is the identity on the gradient (and that the VJP of $f$ is whatever the inner block’s machinery gives us). Composition is just $\mathrm{vjp\_ comp}$. So the residual VJP is the additive fan-in lemma, with $\mathrm{id}$ plugged into one slot.

That is Theorem 37 below, and its proof is one line. This chapter has exactly one theorem, and that theorem is mechanical. The engineering revolution was not in the math—it was in the realization that this operation was the missing piece for training deep networks. The fact that the framework absorbs the new architecture without any new math is the value proposition: once the foundation rules are in place, the next decade of image recognition follows from composing them.

ResNet-34 is sixteen of these blocks

ResNet-34 (the smallest “deep” variant in the original paper) is built from 16 residual blocks arranged in four stages of depths 3, 4, 6, 3, at channel widths 64, 128, 256, 512. Each stage starts with a stride-2 downsample (halving spatial resolution) before its stack of stride-1 blocks. The stages are glued together by a stem (a 7$\times $7 conv at stride 2 plus a max-pool, taking 224$\times $224$\times $3 input to 56$\times $56$\times $64 features) and a head (global average pool plus a 512-to-10 dense layer). Total: 34 weight-bearing layers, 21.3M parameters.

Two pieces of this template are new and worth naming:

The 7$\times $7 stride-2 stem. Earlier chapters’ networks operated at full input resolution; ResNet aggressively downsamples at the input so the bulk of compute happens at lower spatial resolution. The stem is one big conv that gets us from 224$\times $224 to 56$\times $56 in a single step.
Global average pool instead of flatten-plus-dense. Chapters 3 and 4 ended in a .flatten followed by a $\sim $4096-to-512 dense layer that contained tens of millions of parameters. ResNet collapses the final 7$\times $7$\times $512 feature map to a 512-vector by spatial averaging, then runs a single 512-to-10 dense. No flatten layer. The architectural move alone drops the model from what would have been a $\sim $200M-parameter AlexNet-era spec down to 21.3M.

Every chapter after this one is a variation on this template. ConvNeXt swaps in different blocks. MobileNetV2 swaps in depthwise-separable blocks. EfficientNet adds compound scaling. ViT replaces the whole convolutional spine with patches and attention. The stem-stages-head template, and the residual fan-in pattern, persist.

5.1 The theorem

Theorem 37 Residual block VJP

✓

assume:

$B_f$ is a correct backward function for $f$ ($\mathsf{HasVJP}\, f$) [hf]
$f$ is differentiable everywhere [hf_diff]

prove: $\mathsf{HasVJP}\, (\mathrm{residual}\, f)$, where $\mathrm{residual}\, f\, x = f(x) + x$, with backward $B(x, dy) = B_f(x, dy) + dy$.

Proof ▶

$\mathrm{residual}\, f = \mathrm{biPath}\, f\, \mathrm{id}$.
proof: Definitional.
q.e.d.
proof: Instantiate the additive fan-in VJP (Theorem 11) at $g = \mathrm{id}$ (differentiable; its VJP is Theorem 13), with assumptions 1 and 2 supplying the $f$ side. The identity’s backward contributes $dy$ itself, so the composed backward is $B_f(x, dy) + dy$ — the gradient floor that makes ResNets trainable.

5.2 Example: ResNet-34 on Imagenette

Same template as the MNIST and CIFAR chapters, but at the scale the architecture was designed for. ResNet-34 on 224$\times $224 Imagenette — 10 classes, real photographs, the transition point from “classroom dataset” to “actual image recognition.”

The residual block itself is the whole trick: instead of $y = f(x)$, compute $y = f(x) + x$. That’s the additive-fan-in pattern of § 11 with $\mathrm{id}$ on one branch, which is the composition § 37 proves. Stacking 34 layers worth of these blocks just composes the fan-in rule 34 times (16 residual blocks deep and 34 convs across all of them).

The architecture

The same vertical column as the previous chapters, now at production scale: a $7 \times 7$ stride-2 stem and a max-pool drop the $224 \times 224 \times 3$ input to $56 \times 56 \times 64$, then four stages of residual blocks (depths $3, 4, 6, 3$ at widths $64, 128, 256, 512$) do the work, and a global average pool replaces the flatten-plus-fat-dense head. The inset shows the one new piece: each residual block adds its input back, $y = f(x) + x$.

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.58cm, minimum width=5.4cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, convbn/.style = {col, draw=orange!65!black, fill=orange!12}, pool/.style = {col, draw=teal!60!black, fill=teal!10}, resid/.style = {col, draw=violet!70!black, fill=violet!9, minimum height=0.66cm}, gap/.style = {col, draw=purple!60!black, fill=purple!8}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, stage/.style = {font=\sffamily\scriptsize\itshape, gray!55!black, anchor=west}, ] \node[io] (input) {Input \;\; $224\times224\times3$}; \node[convbn, below=0.18cm of input] (stem) {\textbf{ConvBN} stem $3\to64$, $7\times7$, /2}; \node[pool, below=0.18cm of stem] (mp) {\textbf{maxPool} $2\times2$ \;\; $112\to56$}; \node[resid, below=0.18cm of mp] (s1) {$3\times$ \textbf{Residual} $64\to64$ \;\; $56\times56$}; \node[resid, below=0.18cm of s1] (s2) {$4\times$ \textbf{Residual} $64\to128$, /2 \;\; $28\times28$}; \node[resid, below=0.18cm of s2] (s3) {$6\times$ \textbf{Residual} $128\to256$, /2 \;\; $14\times14$}; \node[resid, below=0.18cm of s3] (s4) {$3\times$ \textbf{Residual} $256\to512$, /2 \;\; $7\times7$}; \node[gap, below=0.18cm of s4] (g) {global avg pool \;\; $7\times7\times512 \to 512$}; \node[head, below=0.18cm of g] (d) {\textbf{Dense} $512\to10$ \;(identity)}; \node[logits, below=0.18cm of d] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/stem, stem/mp, mp/s1, s1/s2, s2/s3, s3/s4, s4/g, g/d, d/out} \draw[arr] (\a) -- (\b); \node[stage] at ($(s1.east) + (0.30,0)$) {stage 1}; \node[stage] at ($(s2.east) + (0.30,0)$) {stage 2}; \node[stage] at ($(s3.east) + (0.30,0)$) {stage 3}; \node[stage] at ($(s4.east) + (0.30,0)$) {stage 4}; \end{tikzpicture}$

$\begin{tikzpicture} [ >={Stealth[length=1.6mm]}, every node/.style={font=\sffamily\scriptsize}, box/.style = {align=center, rounded corners=2pt, inner sep=3pt, minimum height=0.62cm, draw=violet!70!black, fill=violet!9}, dot/.style = {circle, draw=violet!70!black, fill=violet!12, inner sep=0pt, minimum size=0.42cm}, term/.style = {font=\sffamily\scriptsize\itshape, inner sep=1pt}, arr/.style = {->, thick, gray!65, shorten >=1pt, shorten <=1pt}, skip/.style = {->, thick, violet!70!black, shorten >=1pt}, ] \node[term] (x) {$x$}; \node[box, right=0.55cm of x] (f1) {ConvBN, ReLU}; \node[box, right=0.45cm of f1] (f2) {ConvBN}; \node[dot, right=0.55cm of f2] (sum) {$+$}; \node[term, right=0.55cm of sum] (y) {ReLU $\to y$}; \draw[arr] (x) -- (f1); \draw[arr] (f1) -- (f2); \draw[arr] (f2) -- (sum); \draw[arr] (sum) -- (y); \draw[skip] (x.north) .. controls +(0,0.75) and +(0,0.75) .. (sum.north) node[midway, above, term, violet!70!black] {identity skip ($+\,x$)}; \node[term, below=0.18cm of f2, gray!55!black] {one residual block: \; $y = f(x) + x$}; \end{tikzpicture}$

-- 1
import LeanMlir

-- 2
def resnet34 : NetSpec where
  name   := "ResNet-34"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 64 7 2 .same,         -- stem
    .maxPool 2 2,
    .residualBlock  64  64 3 1,     -- stage 1 — 3 blocks
    .residualBlock  64 128 4 2,     -- stage 2 — 4 blocks
    .residualBlock 128 256 6 2,     -- stage 3 — 6 blocks
    .residualBlock 256 512 3 2,     -- stage 4 — 3 blocks
    .globalAvgPool,                 -- replaces flatten + fat dense
    .dense 512 10 .identity
  ]

-- 3
def resnet34Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  resnet34.train resnet34Config (args.head?.getD "data/imagenette")

First chapter that uses the full production recipe (Adam + cosine + warmup + weight decay + augmentation + label smoothing) rather than the s4tfBaseline. Also the first chapter where .globalAvgPool replaces the flatten-plus-fat-dense head. Global average pooling (GAP) collapses each $C \times H \times W$ feature map to a length-$C$ vector by averaging over the spatial dimensions,

\[ \mathrm{gap}(x)_c \; =\; \frac{1}{HW} \sum _{i, j} x_{c, i, j}, \]

so the classifier head sees one summary scalar per channel instead of the full $C \cdot H \cdot W$ flattened activation. That’s the architectural move that drops ResNet-34 from what would have been a $\sim $200M parameter AlexNet-era spec down to 21.3M. We’ll see GAP again as the “Squeeze” step inside the squeeze-and-excitation block in Chapter 7, and at the head of every post-2015 vision architecture in the bestiary.

Results

Build and run:

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/resnet34-train
ResNet-34: 21289802 params
Generating train step MLIR...
  517912 chars
Compiling vmfbs...
  forward compiled
  eval forward (fixed BN) compiled
  compiled
  session loaded
  train: 9469 images (256×256)
  21289802 params + m + v (243 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 36, BN stat floats: 17024
  step 0/295: loss=3.581499 (1478ms)
Epoch  1/80: loss=1.812683 lr=0.000333 (423692ms)
Epoch  2/80: loss=1.512727 lr=0.000667 (423510ms)
Epoch  3/80: loss=1.393462 lr=0.001000 (425071ms)
...
Epoch 78/80: loss=0.502628 lr=0.000004 (426600ms)
Epoch 79/80: loss=0.502615 lr=0.000002 (427378ms)
Epoch 80/80: loss=0.502586 lr=0.000000 (427151ms)
  val accuracy (running BN): 3525/3904 = 90.29%

(Full log in logs/r34_bn.log.)

Eighty epochs, about 425 seconds per epoch on the 7900 XTX ($\sim $7 minutes), total wall time around 9.5 hours. Final test accuracy 90.29% on Imagenette — the going rate for scratch-trained ResNet-34 at this resolution and data size.

Some numbers worth noticing:

- 21.3M params, but only about 517 KB of MLIR (517 912 chars) for the entire training step — roughly 25$\times $ the MLP, 24$\times $ the no-BN CIFAR CNN. The network deepened by 10$\times $ more layers, but the emitted compute only grew linearly in depth: IREE’s code generation is per-op, not per-parameter, so 3$\times $3 convs at different channel widths all emit similar-sized StableHLO.

- Per-step time is 1.4 s, dominated by the four .residualBlock stages (their $3 \times 3$ convs at 64, 128, 256, 512 channels are where the FLOPs live). The first block alone at 64 channels does $3 \times 3 \times 64 \times 64$ per output spatial location — about $3.4 \times 10^8$ FLOPs per forward image, times the batch of 32, times three blocks at that width, times 295 batches per epoch, times two (forward and backward).

- Final loss plateaus near 0.50, not zero. That’s the label-smoothing floor: with $\epsilon = 0.1$ and 10 classes, the minimum possible cross-entropy against a smoothed target is $-0.9 \log 0.9 - 0.1 \log (0.1/9) \approx 0.485$. The 0.50 value at epoch 80 means the model has essentially converged; more epochs would mostly just extend training time without budging the loss.

- BN layers: 36. Every residual block contains two convBn layers, and there are 16 residual blocks across the four stages (3+4+6+3). Plus the stem convBn. That’s 33 from the residualBlock primitive expansions plus 3 stem/projection layers, for 36 total. Each one proves its own VJP via § 36 and composes with the rest of the network via § 10.

5.3 MLIR: Residual

What is already proven. A residual block is $\mathrm{residual}(x) = f(x) + x$ — an additive fan-in, where $f$ is the conv $\to $ BN $\to $ relu $\to $ conv $\to $ BN inner block. As Theorem 37 states, its reverse-mode derivative is the additive-fan-in rule with the identity in one slot: send the upstream gradient through both branches and add. residual_has_vjp_at proves exactly this at the evaluation point (a one-line proof: the fan-in lemma with the identity’s VJP plugged into one slot), and the inner $f$’s VJP is the convolution and BatchNorm backward already proven in Chapters 3 and 4. The whole block — and a tower of sixteen of them — composes by the chain rule vjp_comp_at (§ 10).

The gap and how we close it. The residual fan-in is the structural heart of a deep ResNet. The emitted backward is given a denotation valued in the proofs’ own tensor type, shown equal to residual_has_vjp_at’s backward, and the structural move is visible in the emitted text. The forward saves $\texttt{add} = f(x) + x$ and the block output $\texttt{relu}(\texttt{add})$; the backward first pushes the incoming cotangent back through that relu (a compare/select at the post-add, giving %dadd), and then — this is the fan-in — sends %dadd through both branches and adds (the inner block’s forward and $f$-backward elided, as they are the conv/BN nodes of Chapters 3–4):

// forward (saved): %out = f(%x)  [conv-BN-relu-conv-BN, Ch. 4-5]
//                  %add = %out + %x
%madd = stablehlo.compare GT, %add, %zc
          : (tensor<1x2x4x4xf32>, tensor<1x2x4x4xf32>)
            -> tensor<1x2x4x4xi1>
%dadd = stablehlo.select %madd, %dOut, %zc
          : tensor<1x2x4x4xi1>, tensor<1x2x4x4xf32>
// %dF = f.backward(%dadd)  [conv/BN/relu input-VJP chain, Ch. 4-5]
%dx = stablehlo.add %dF, %dadd : tensor<1x2x4x4xf32>
return %dx : tensor<1x2x4x4xf32>

Read it against the theorem: %dadd is the cotangent at the fan-in, and it appears twice in the last two lines — once as the input to $f$’s backward (%dF, the gradient through the residual function) and once added directly (%dx = %dF + %dadd, the gradient straight down the identity skip). That single stablehlo.add is “send the gradient through both branches and add”: the skip term is the same %dadd the block term started from, which is why the gradient survives all sixteen blocks instead of attenuating away. The inner $f$ backward that computes %dF — the convolution and BatchNorm input-VJPs — is bridged op by op exactly as in Chapters 3 and 4; only the fan-in is new here.

Caveats.

The fan-in add is unconditional — addition is linear, so that bridge holds at every input.
The post-add ReLU is a smooth-point bridge (no pre-activation exactly zero), and the two BatchNorms inside $f$ carry their own $\epsilon {\gt} 0$ conditions from Chapter 4.
Representative scale — a two-channel block at $4\times 4$; the same %dadd fan-in repeats for all sixteen blocks.

5.4 What’s in the production recipe?

Six ingredients first appear here, none of which are layers — they all live in TrainConfig or training code. They don’t touch the network, but they’re what moves a scratch-trained model from 72.82% (plain SGD + momentum, no other tricks) to 90.29%.

Adam (Kingma & Ba, 2014, arXiv:1412.6980). Replaces vanilla SGD. Maintains per-parameter running estimates of the first and second moments of the gradient and uses them for an adaptive per-parameter learning rate. Much faster convergence than SGD on most tasks, much more forgiving of learning-rate choice.

Cosine learning-rate schedule (Loshchilov & Hutter, 2016, arXiv:1608.03983). After warmup, the learning rate decays from its peak to near-zero following $\mathrm{lr}(t) = \mathrm{lr}_{\max } \cdot \tfrac {1}{2}(1 + \cos (\pi t / T))$. Smoother than step schedules; consistently produces slightly better final losses in practice. Paired with warmup and progressive resizing (Howard et al., 2018), this was the winning recipe at Stanford’s DAWNBench in 2018; the lineage traces back to Leslie Smith’s cyclical learning rates (Smith 2015, arXiv:1506.01186). Popularized for ImageNet-scale training by He et al. 2018’s “Bag of Tricks” (arXiv:1812.01187).

Warmup (Goyal et al., 2017, arXiv:1706.02677). Linearly ramps the learning rate from zero to its peak over the first few epochs (3 for ResNet-34). Stabilizes early training when gradients are still reorganizing randomly-initialized weights. Becomes essential for larger models and larger batches — by ViT-Tiny in Chapter 9 it’s a necessity.

Weight decay. Adds a small L2 shrink to all trainable weights each step: $w \leftarrow w - \eta \lambda w$. Regularizer, discourages weights from growing without bound. Loshchilov & Hutter’s “decoupled” variant (AdamW, 2017, arXiv:1711.05101) is what actually ships in modern pipelines, and it’s what TrainConfig.weightDecay implements.

Augmentation. Our pipeline does random crops ($256 \to 224$) and random horizontal flips on Imagenette; hflip only on CIFAR; nothing on MNIST. Lives in the data pipeline, not the network. We will explore data augmentation strategies in upcoming chapters.

Label smoothing (Szegedy et al., 2016, arXiv:1512.00567). Replaces the one-hot target with a smoothed $(\varepsilon / (K-1), \ldots , 1 - \varepsilon , \ldots )$. Stops the network from producing arbitrarily large logits on the correct class, which keeps training stable and improves calibration. Explains the 0.50 loss floor in the Results section above: $-0.9 \log 0.9 - 0.1 \log (0.1/9) \approx 0.485$.

Every chapter after this one uses the same six ingredients. The NetSpec changes; the recipe stays.

5.5 Ablation: what each ingredient contributes

So how much does each ingredient actually earn? We ran the full recipe plus six leave-one-out variants on ResNet-34 Imagenette, 80 epochs each, same init and batch order. Each ablation row keeps everything except the one named component, so the lift measures that component’s contribution given everything else is present — an honest answer to “is this piece pulling its weight?” rather than the order-dependent story you’d get from an additive ladder.

Run	Val accuracy	$\Delta $ vs full
Full recipe	90.29%	—
Bare recipe (plain SGD + momentum, no other tricks)	72.82%	$-$17.47
Full minus basic augmentation	82.71%	$-$7.58
Full minus cosine decay	86.81%	$-$3.48
Full minus Adam (vanilla SGD, lr 0.01)	87.04%	$-$3.25
Full minus warmup	88.88%	$-$1.41
Full minus label smoothing	89.24%	$-$1.05
Full minus weight decay	89.68%	$-$0.61

Three observations worth naming:

Augmentation is by far the largest single contribution ($-$7.58 points, more than 2$\times $ any other knob). Imagenette has just 9469 training images for a 21M-parameter network, so without augmentation the model overfits hard — training loss drops below $0.001$ while val accuracy stalls. Augmentation is doing most of the work that the rest of the recipe gets credit for.

Cosine decay and Adam are roughly tied for largest non-augmentation contribution ($-$3.48 and $-$3.25 points). They operate on different axes — cosine shapes the learning-rate trajectory, Adam shapes the per-parameter update magnitude — but their late-training influence on val accuracy is comparable.

The contributions are essentially additive. Sum of all six leave-one-out deltas: $-17.38$ points. Bare-recipe delta vs full: $-17.47$ points. The two agree to within rounding, suggesting the modern recipe’s ingredients aren’t significantly interacting — each one buys roughly the same lift independently of which others are present. Weight decay and label smoothing are the smallest contributors ($-$0.61 and $-$1.05) but show the overfit-then-catch-up signature mid-training (e.g., no-WD lands $-$6 points at epoch 10 before the gap narrows by epoch 80).

A note on the SGD configs: r34-no-adam replaces Adam with vanilla SGD at lr=0.01 (no momentum), keeping everything else from the full recipe. r34-bare uses SGD with momentum 0.9 at lr=0.01 and turns all other tricks off. The two SGD configs therefore differ on momentum as well as on the recipe ingredients, so the bare row should be read as a “what does the modern recipe buy us total” baseline rather than a strict additive component of the leave-one-out chain.

5.6 ImageNet recipe

Why the phase-2 trainer? The ImageNet runs in this book use the phase-2 (Lean$\to $JAX) trainer: at this scale its job is to validate the framework’s logic end to end. Whether the phase-3 verified-IREE codegen can reach ImageNet under its own codegen rules is an open question — phase-2 is how we establish these baselines in the meantime.

The Imagenette runs above all trained on $\sim $9.5K images for 80 epochs. The headline ResNet result — 21.8M parameters reaching 74.4% top-1 on full 1000-class ImageNet — comes from a different recipe: 1.28M training images, 90 epochs, paper-recipe SGD with momentum. The architecture is unchanged; only the dataset, the output head width, and the training schedule differ. Here’s what the loop looks like:

-- 1. Same backbone, 1000 classes instead of 10.
def resnet34Imagenet : NetSpec where
  name   := "ResNet-34 (ImageNet)"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 64 7 2 .same,
    .maxPool 3 2,
    .residualBlock  64  64 3 1,
    .residualBlock  64 128 4 2,
    .residualBlock 128 256 6 2,
    .residualBlock 256 512 3 2,
    .globalAvgPool,
    .dense 512 1000 .identity      -- 1000-class head
  ]

-- 2. Paper recipe (SGD + momentum) plus modern polish.
def resnet34ImagenetConfig : TrainConfig where
  learningRate := 0.1
  batchSize    := 256
  epochs       := 90
  useAdam      := false             -- paper: SGD + momentum
  momentum     := 0.9
  weightDecay  := 1e-4
  cosineDecay  := true              -- replaces paper's step decay at ep 30/60
  warmupEpochs := 5                 -- modern addition
  augment      := true              -- random-resized-crop + horizontal flip
  labelSmoothing := 0.1             -- modern addition

-- 3. Same .train entry point as every other chapter.
def main (args : List String) : IO Unit :=
  resnet34Imagenet.train resnet34ImagenetConfig
    (args.head?.getD "data/imagenet") .imagenet

Today, the phase-2 JAX path (jax/MainResnetImagenet.lean, runs via lake build resnet34-imagenet) wires .imagenet through tfds streaming. The phase-3 IREE path shown in the snippet above still needs a C-side streaming reader for the 1.28M-image training set (Imagenette’s 9.5K fits in a single ByteArray; ImageNet doesn’t) — until that lands, the phase-3 dispatcher panics with a “use phase 2” note when handed .imagenet. What’s worth noting is that everything else — the architecture, the .train invocation, the recipe knobs — is identical across both paths; only the data-loading layer differs. The framework isn’t doing anything fundamentally different at ImageNet scale; it’s just the data that’s bigger.

Compute budget. The phase-2 (Lean$\to $JAX) path trains resnet34Imagenet on two AMD Radeon RX 7900 XTX GPUs (consumer hardware: gfx1100, ROCm 7.2, JAX 0.10.0, multi-GPU sharding via Mesh + NamedSharding(P(’batch’)) with RCCL preloaded). At batch 256 (128 per device), one epoch over the 1.28M-image training set runs in about 24 minutes of compute plus a 10-second val pass; a 30-epoch run finishes in roughly 15 wall-clock hours including first-epoch JIT compile. Final result on the standard 50K-image validation split is $\mathbf{69.26\% }$ top-1 (val_loss $1.232$). The parameter checkpoint (jax_r34_imagenet.bin, 87 MB, 37 param groups) is written via the same params_to_file emit that phase-3 IREE consumes — so the JAX-trained weights drop into a phase-3 TrainConfig.bootstrapBackbone field byte-for-byte without a separate conversion step.

The recipe trims the paper’s 90 epochs to 30 (SGD with momentum $0.9$, batch 256, cosine LR with a 5-epoch warmup from peak $0.1$, label smoothing $0.1$, Inception-style random-resized-crop (sampling $8$–$100\% $ of the image area at a $\frac{3}{4}$–$\frac{4}{3}$ aspect ratio, then resizing to $224\times 224$) plus a horizontal flip, $\ell _2$ weight decay $10^{-4}$) — a deliberate trade for roughly $95\% $ of the paper’s $\sim $73% headline at $\sim $1/3 the compute.

bf16 and the full 90-epoch run. Bf16 mixed precision roughly halves the wait. On a second box (four usable NVIDIA RTX 4060 Ti, CUDA/cuDNN), the same resnet34Imagenet spec with bf16 := true casts both the matmuls and the convolutions to bfloat16 while keeping fp32 master weights. NVIDIA’s tensor cores run bf16 convolution about $1.6\times $ faster than fp32 — the opposite of AMD’s MIOpen, where bf16 conv is a wash, which is why conv precision is gated behind its own bf16Conv flag (matmul-only bf16 helps transformers on either vendor; bf16 conv only pays off on cuDNN). The wall-clock difference over a full 90-epoch run, batch 256 across the four GPUs:

GPU	Precision	Per epoch	Epochs	Total	Val top-1	Val top-5
2$\times $ 7900 XTX (ROCm)	fp32	$\sim $24.2 min	30	$\sim $15 hr	$\mathbf{69.26\% }$	—
4$\times $ 4060 Ti (CUDA)	fp32	$\sim $16.8 min	90	$\sim $25 hr	—	—
4$\times $ 4060 Ti (CUDA)	bf16	$\sim $10.5 min	90	$\sim $16 hr	$\mathbf{72.02\% }$	$\mathbf{90.62\% }$
1$\times $ A100 (CUDA)	bf16	—	90	$\sim $9 hr	—	—
1$\times $ MI300X (ROCm)	bf16	$\sim $3.5 min	90	$\sim $5.4 hr	$\mathbf{73.95\% }$^†	$\mathbf{91.76\% }$

ResNet-34 / ImageNet-1k: per-epoch throughput and final validation for the completed runs (ROCm at 30 epochs, CUDA at the full 90; the MI300X row is a single-GPU batch-1024 run on gfx942). The A100 row is a wall-clock estimate, accuracies pending. ^† MI300X used batch 1024 / lr 0.4 (linear large-batch scaling) vs batch 256 / lr 0.1 for the other rows, so its $+1.9$-point top-1 over the 4060 Ti bf16 row is the large-batch scaling, not the hardware; a matched batch-256 run for an apples-to-apples hardware comparison is still to do.

The bf16 run finished all 90 epochs in about 16 wall-clock hours and reached $\mathbf{72.02\% }$ top-1 / $\mathbf{90.62\% }$ top-5 on the full 50,000-image validation split — squarely on the paper’s ResNet-34 mark, in bf16, on consumer GPUs. (fp32 was measured at the matched $1.6\times $-slower throughput but not run to completion; bf16 does not cost final accuracy here.) The $1.6\times $ figure tracks a microbenchmark of the bare conv operator at this network’s stage-1 shape almost exactly, so the speedup is the convolution itself going through tensor cores, not a scheduling artifact.

The validation curve over the run tells the usual story: a fast climb through the warmup and high-LR phase, a long shallow middle, and a final lift as the cosine schedule anneals the learning rate to zero.

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Validation accuracy (\%)}, xmin=0, xmax=91, ymin=10, ymax=95, xtick={0,15,30,45,60,75,90}, ytick={20,40,60,80}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (1,15.25) (2,29.16) (3,37.00) (4,40.99) (5,42.79) (6,45.70) (7,47.22) (8,48.63) (9,48.70) (10,49.19) (11,50.36) (12,50.45) (14,51.33) (15,51.84) (16,51.97) (17,51.84) (18,52.52) (20,53.23) (21,53.34) (22,53.54) (23,53.40) (24,54.19) (25,54.15) (26,54.17) (27,54.08) (28,54.70) (29,55.17) (31,55.06) (32,55.37) (33,55.45) (34,55.42) (35,56.37) (37,56.53) (38,56.95) (39,56.44) (40,57.35) (41,57.19) (43,57.82) (44,58.03) (45,58.35) (46,58.76) (47,58.92) (49,59.34) (50,59.32) (51,60.14) (52,59.88) (53,60.53) (55,61.14) (56,61.24) (57,61.68) (58,61.69) (59,62.77) (61,63.26) (62,63.25) (63,63.92) (64,64.59) (65,64.50) (67,65.08) (68,65.05) (69,65.89) (70,66.35) (71,66.85) (73,67.43) (74,68.00) (75,68.11) (76,68.68) (77,69.27) (78,69.50) (79,69.81) (80,70.28) (81,70.40) (82,70.94) (83,71.15) (85,71.60) (86,71.93) (87,71.98) (88,72.12) (89,72.06) (90,72.08) }; \addlegendentry{top-1} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (1,35.44) (2,55.47) (3,64.13) (4,68.26) (5,69.58) (6,72.41) (7,73.63) (8,74.74) (9,74.88) (10,75.76) (11,76.57) (12,76.57) (14,77.24) (15,77.50) (16,77.78) (17,77.42) (18,78.07) (20,78.76) (21,78.82) (22,78.49) (23,78.66) (24,79.17) (25,79.31) (26,79.23) (27,79.40) (28,79.53) (29,80.24) (31,79.79) (32,80.07) (33,80.14) (34,80.43) (35,81.07) (37,81.00) (38,81.18) (39,81.11) (40,81.63) (41,81.57) (43,81.84) (44,82.01) (45,82.25) (46,82.64) (47,82.93) (49,82.97) (50,83.02) (51,83.49) (52,83.27) (53,83.70) (55,84.19) (56,84.34) (57,84.45) (58,84.50) (59,85.24) (61,85.38) (62,85.47) (63,85.86) (64,85.98) (65,86.42) (67,86.78) (68,86.72) (69,87.29) (70,87.52) (71,87.68) (73,88.30) (74,88.32) (75,88.33) (76,88.74) (77,89.03) (78,89.28) (79,89.34) (80,89.60) (81,89.71) (82,89.86) (83,90.14) (85,90.34) (86,90.48) (87,90.61) (88,90.61) (89,90.66) (90,90.63) }; \addlegendentry{top-5} \end{axis} \end{tikzpicture}$

ResNet-34 / ImageNet-1k validation accuracy per epoch (bf16, 4$\times $ 4060 Ti). The kink near epoch 5 is the end of warmup; the steady late climb is the cosine anneal.

A reasonable stopping point

For the fastest path through this book, the route is roughly:

The intro chapters & Ch 1: framework setup (mandatory context).
Ch 2–3: MLP and CNN worked examples — the simplest cases of the framework reaching real architectures.
Skim Ch 4 (BatchNorm). It’s structurally the hardest chapter and the prose says so up front. The math is right (pdiv_bnNormalize’s three-term cancellation is formalized and matches the codegen), but it’s not the place to spend time if you’re not specifically interested in BN. Spend some of the time you’re saving on running as many MNIST and CIFAR demos as possible.
Read this chapter (Ch 5, ResNet-34). Residual blocks are where the chain rule shows it can carry composition through arbitrarily deep architectures; the proof reduces to one additive-fan-in lemma you’ve already seen. Run the ResNet-50 RSB-A3 ImageNet training demo locally and/or in the cloud ($\sim $6–9 hrs, $\sim $$15 in compute).
Stop here. Assume Chapters 6–9 (MobileNet, EfficientNet, ConvNeXt, ViT) and the Bestiary (Ch 10) are correct — they all reuse the same framework with new operators slotting into the same proof tree. Appendix C walks through how that claim is auditable if you ever want to spot-check it.

Reading through Ch 5 is enough to know the framework works. For a reinforcement-learning detour, read the AlphaGo Zero paper (bestiary entry), which combines reinforcement learning with a large residual network of $3\times 3$ convolutions.

Or, for breadth.

If you want to cover most of modern deep learning instead of stopping short, the fuller path is to keep reading through Ch 9 (ViT) for the full image-classification arc and then run the trainers in demos/:

lake exe yolov1-pets-train-bootstrap # detection (YOLO on Pets, cat/dog)
lake exe unet-pets-train         # segmentation (UNet on Pets)
lake exe cifar-ddpm-train        # diffusion generative models
lake exe gradcam                 # explainability
lake exe tinygpt-shakespeare train   # NLP (char-level transformer)

The arc from ViT to TinyGPT in particular reuses the attention proofs from Ch 9 verbatim — image-token attention and character-token attention are the same operator, just at different token alphabets. Image classification + detection + segmentation + diffusion + NLP, all on the same Lean$\to $MLIR$\to $IREE pipeline, covers most of what people mean by “modern deep learning.”

Side quest: ResNet-50, RSB-A3 (100 epochs)

The basic block ResNet-34 is built from generalizes to the bottleneck block of ResNet-50, and the jax/ folder ships it as a full ImageNet demo (jax/MainResnet50Imagenet.lean) trained on the modern “ResNet Strikes Back” A3 recipe (Wightman et al. 2021): LAMB at effective batch 2048 (via gradient accumulation), BCE over Mixup/CutMix soft targets, RandAugment, train@160 / eval@224. On the same proven-backward bottleneck codegen it reached $\mathbf{77.22\% }$ top-1 / $\mathbf{93.34\% }$ top-5 at 100 epochs — within $0.9$ points of the paper’s $78.1\% $, reproducing (and slightly beating) an earlier $76.66\% $ / $93.03\% $ run to within run-to-run noise.

Compute + the BN-sharding axis. The run trained on the same four usable NVIDIA RTX 4060 Ti (CUDA, bf16 matmul and convolution) as the ResNet-34 bf16 run, at effective batch 2048. Four 16 GB cards cannot hold that batch on one device, so it is split $4\times 512$ across the GPUs and accumulated $4\times $ to reach 2048 — which means BatchNorm normalizes over each $512$-image micro-batch (“Ghost-BN”), not the full $2048$. A single $\sim $80 GB A100 holds the whole batch, so the true-2048 recipe (no gradient accumulation) normalizes BatchNorm over all $2048$ images at once. Same network, same effective batch — the only difference is the BN statistic group, which is the axis this comparison isolates:

GPU	BN regime	Per epoch	Epochs	Total	Val top-1	Val top-5
4$\times $ 4060 Ti (CUDA, bf16)	Ghost-BN (512)	$\sim $9.1 min	100	$\sim $17 hr	$\mathbf{77.22\% }$	$\mathbf{93.34\% }$
1$\times $ A100 (bf16, true-2048)	full-batch (2048)	$\sim $6.3 min	100	$\sim $10.5 hr	$\mathbf{77.00\% }$	$\mathbf{93.15\% }$
1$\times $ MI300X (bf16, true-2048)	full-batch (2048)	—	100	$\sim $7.2 hr^†	—	—

ResNet-50 / RSB-A3, 100 epochs: the 4-GPU run splits the batch and normalizes BatchNorm per $512$-image shard; the A100 true-2048 row normalizes over the full $2048$. The $\sim $17 hr wall-clock includes three 30-minute thermal duty-cycle cooldowns (the 4060 Ti box is duty-cycled on long runs); the run itself logged zero PCIe/AER interruptions across all 100 epochs. ^† MI300X row is a projected single-device rerun ($\sim $7.2 hr, no gradient accumulation); its per-epoch time and accuracies are pending the run.

What the A100 row tests. Removing the Ghost-BN approximation barely moves the needle: true-2048 lands at $77.00\% $ against Ghost-BN’s $77.22\% $ — a $\sim $0.2-point difference, and slightly lower, well inside run-to-run noise. Normalizing BatchNorm over the full $2048$ neither helps nor hurts, so the BN-sharding regime is not where the residual $\sim $1 point to the paper’s $78.1\% $ hides.

Where it does hide: the resize. That gap turns out to be an evaluation artifact, not a training one. Both rows here are scored through this repo’s tf.image.resize preprocessing, whereas the timm reference reports $78.1\% $ through PIL’s bicubic resize — and the two resamplers are not interchangeable. Port these weights into a standard PIL-resize eval and they fall to $\sim $74.4%; port the reference a3_in1k weights into this pipeline and they fall to $\sim $74%. The ranking flips with the resize library: each network scores best on the resampler it trained with, because it partly fits that resampler’s specific aliasing. So the honest reading is that the two recipes sit within a couple of points, and which one “wins” is decided by the eval pipeline — not that this run is a point short.

[TODO — resize/eval reconciliation.] Put true-2048 and the timm reference on one ruler — a single fixed resampler, or a bit-exact tf.image.resize match in the reference eval — before reading anything into the $77.00\% $ vs $78.1\% $ gap.

Changed since the $76.66\% $ first run: RandAugment is now timm-literal (rand-m6-mstd0.5-inc1).

What A3 changes from the 2018 recipe

The ResNet-34 trainer above is a 2018-era recipe: the original paper’s SGD-with-momentum plus the “bag of tricks” polish (cosine schedule, warmup, label smoothing, random-resized-crop). RSB-A3 is the 2021 “ResNet Strikes Back” recipe (Wightman et al.). The architecture aside (basic block vs. bottleneck), every training knob that A3 sets differently is one lever of the three-year jump in how these networks are trained. The backbone code and the .train entry point are identical; only the TrainConfig differs. Here is the full diff:

Knob	ResNet-34 (2018)	RSB-A3 (2021)	What the A3 choice buys
Backbone	basic block, 21.8 M	bottleneck, 25.6 M	deeper net at similar cost; $1\times 1$–$3\times 3$–$1\times 1$ bottleneck
Optimizer	SGD + momentum $0.9$	LAMB	layer-wise adaptive LR, designed for very large batches
Effective batch	256	2048 ($512\times 4$ accum)	the large-batch regime LAMB targets; grad-accum fits it on 16 GB
Peak LR	$0.1$	$0.008$ @ bs2048	LAMB’s trust-ratio scale — not comparable to SGD’s $0.1$
Epochs	90	100	essentially the same budget (A3 is the cheap RSB tier)
Loss	softmax cross-entropy	BCE-with-logits, multi-hot	treats classes independently; matches mixed soft targets
Label smoothing	$0.1$	$0.0$	subsumed — BCE over soft mixup labels already softens targets
Mixup / CutMix	none	$\alpha \, 0.1$ / $\alpha \, 1.0$	interpolate/paste samples $\to $ soft multi-hot targets, strong regularizer
RandAugment	none	m6 N2 mstd0.5 inc1	automated heavy color+geometric augmentation policy
Weight decay	$10^{-4}$, all params	$0.02$, skip BN $\gamma /\beta $ + bias	$200\times $ stronger, but never decays scale/shift/bias params
Train / eval res	$224$ / $224$	$160$ / $224$	FixRes: train at cheap low res ($\sim $2$\times $ faster/step), test at high res — the train/test gap helps
BN statistics	global over batch	Ghost-BN over 512	side-effect of grad-accum (each micro-step normalizes its own 512); the A100 true-2048 row removes it

What carries over unchanged: cosine LR decay, the 5-epoch warmup, random-resized-crop + horizontal flip as the base augmentation, bf16 matmul and convolution, and running-BN evaluation. The through-line of the 2018$\to $2021 shift is regularize harder, train bigger: heavier augmentation (RandAugment + Mixup + CutMix), a loss that accepts soft targets (BCE), an order-of-magnitude more weight decay applied selectively, and a large-batch optimizer (LAMB) to make the longer, noisier training converge — with FixRes buying back the compute the extra epochs would cost. Every one of these is a one-line TrainConfig change on the same verified backbone.

Run	Val accuracy	\(\Delta \) vs full
Full recipe	90.29%	—
Bare recipe (plain SGD + momentum, no other tricks)	72.82%	\(-\)17.47
Full minus basic augmentation	82.71%	\(-\)7.58
Full minus cosine decay	86.81%	\(-\)3.48
Full minus Adam (vanilla SGD, lr 0.01)	87.04%	\(-\)3.25
Full minus warmup	88.88%	\(-\)1.41
Full minus label smoothing	89.24%	\(-\)1.05
Full minus weight decay	89.68%	\(-\)0.61