9 ConvNeXt

ConvNets strike back

By 2022 the conventional read on image recognition had flipped. ResNet had been the default backbone for six years. Then in late 2020 ViT (Dosovitskiy et al., arXiv:2010.11929) showed that a pure transformer could match or beat ResNet on ImageNet at sufficient scale, and a year later Swin (Liu et al. 2021, arXiv:2103.14030) layered in hierarchical attention to win COCO/ADE20K on top of that. The research community’s working assumption became “transformers beat convnets for vision now.”

ConvNeXt (Liu et al. 2022, arXiv:2201.03545) was a deliberate stress test of that assumption. The authors started from a standard ResNet-50 and asked: which of the design differences between ResNet and Swin is actually doing the work? They proceeded by applying transformer-era choices to the ResNet one at a time and measuring each contribution in isolation. The full list:

Training recipe modernization (300 epochs, AdamW, RandAug, Mixup, CutMix, label smoothing, stochastic depth, layer scale)
Stage compute ratio change (from (3, 4, 6, 3) to (3, 3, 9, 3), matching Swin)
“Patchify” stem (replace the 7\(\times \)7 stride-2 conv with a 4\(\times \)4 stride-4 conv, matching ViT’s patch embedding)
Depthwise convolutions (split spatial and channel mixing, as in MobileNetV2)
Inverted bottleneck (1\(\times \)1 expand 4\(\times \), depthwise, 1\(\times \)1 project, MobileNetV2-style)
Larger kernel (depthwise 7\(\times \)7 instead of 3\(\times \)3, to approximate Swin’s window attention receptive field)
Fewer activations and norms (one per block instead of one per layer, again matching transformer blocks)
LayerNorm instead of BatchNorm
GELU instead of ReLU

The result: a model that’s pure convolution top to bottom, doesn’t contain a single attention operation, and matches or beats Swin-T on ImageNet at the same compute budget. The conclusion of the paper isn’t “convolutions are back”—it’s “the architectural gap between ResNet and Swin was mostly the training recipe, not the receptive field.” Most of what looked like transformer superiority was modernization that hadn’t been back-ported to the convnet baseline.

Two new primitives

ConvNeXt requires two activations / normalizations that the chapters so far haven’t needed:

LayerNorm (Ba, Kiros, Hinton 2016, arXiv:1607.06450). BatchNorm normalizes per-feature across a batch: it relies on the batch having enough samples for the mean and variance to be stable. That fails for small batches (memory-constrained training) and for variable-length sequences (where “the batch axis” isn’t a clean concept). LayerNorm solves both by flipping the axis: normalize per-sample across the feature dimension instead. Same three-term Jacobian cancellation as BN (Chapter 5), same closed-form backward; just a different reduction axis. Theorem 44 states this formally.

GELU (Hendrycks & Gimpel 2016, arXiv:1606.08415). A smooth approximation of ReLU: \(\mathrm{GELU}(x) = x \cdot \Phi (x)\), where \(\Phi \) is the standard-normal CDF. ReLU has a kink at zero, which forces the codegen to substitute a subgradient convention (Chapter 3). GELU is differentiable everywhere, with a soft-gate behavior near zero that empirically helps transformers (and, it turns out, modernized CNNs). Diagonal activation Jacobian, same family as ReLU’s diagonal, just with a smooth scalar derivative instead of an indicator. Theorems 41–45 provide the scalar function, its derivative, the Jacobian, and the assembled VJP.

Both primitives are used by the upcoming Vision Transformer chapter (Chapter 10) without modification. ConvNeXt introduces them here so they’re available before ViT needs them.

The ConvNeXt block

Each ConvNeXt block stacks the modernization items into a single residual unit:

\[ \text{DW } 7 \times 7 \; \to \; \text{LN} \; \to \; 1 \times 1\ \text{expand } 4\times \; \to \; \text{GELU} \; \to \; 1 \times 1\ \text{project} \; \to \; \text{LayerScale} \; \to \; +\, \text{residual}. \]

Two pieces are worth naming. The inverted bottleneck (\(1 \times 1\) expand 4\(\times \), depthwise spatial conv at the wide point, \(1 \times 1\) project) is exactly the MobileNetV2 inverted residual from Chapter 7—ConvNeXt is paying the MobileNet idea forward. LayerScale (Touvron et al. 2021) is a learnable per-channel scalar \(\gamma \) initialized to \(10^{-6}\) that multiplies the block’s contribution before the residual add, keeping the block near-identity at init in the same spirit as ResNet’s residual. The spec’s .convNextStage primitive emits \(N\) of these blocks at fixed channel width; the .convNextDownsample primitive emits the LN + \(2 \times 2\) stride-2 conv that transitions between stages.

ConvNeXt-T is four stages, ResNet-style

The full ConvNeXt-T spec is the same stem–stages–head template as ResNet-34: a \(4 \times 4\) stride-4 “patchify” stem replaces the \(7 \times 7\) stride-2 stem; four stages with block counts \((3, 3, 9, 3)\) at channel widths \((96, 192, 384, 768)\) replace the ResNet stages; a final global-average-pool plus 768-to-10 dense replaces the head. Total: 27.8M parameters, comparable to ResNet-50. Every architectural primitive in the spec is now codegen-backed and has a proved VJP; none of the math in this chapter is genuinely new beyond the LayerNorm-axis rotation and the GELU smoothness.

The bigger pedagogical point—and the one the example section below makes concrete—is that with a modernized convnet backbone, the dominant axis of variation between runs becomes the training recipe, not the architecture. The chapter finishes with a CutMix-vs-bare-vs-RandAugment ablation that lifts ConvNeXt-T’s Imagenette accuracy by 2.9 points without touching a single weight.

The theorems

Definition 41 GELU scalar function

✓

Concrete \(0.5x(1 + \mathrm{erf}(x/\sqrt2))\).

Definition 42 GELU scalar derivative

✓

Closed form for \(\mathrm{geluScalar}'\).

Theorem 43 GELU Jacobian

✓

Diagonal activation Jacobian. Proved via fderiv_apply + chain rule with \(\mathrm{geluScalar} \circ \mathrm{ContinuousLinearMap.proj}\, j\), then fderiv_eq_smul_deriv to convert scalar \(\operatorname {fderiv}\) to \(\mathrm{deriv}\).

Proof ▶

Mechanical; see Proofs.pdiv_gelu.

Theorem 44 LayerNorm VJP

✓

LayerNorm = BatchNorm on a different axis; same primitive.

Proof ▶

Mechanical; see Proofs.layerNorm_has_vjp.

Theorem 45 GELU VJP

✓

Proof ▶

Mechanical; see Proofs.gelu_has_vjp.

9.1 Example: ConvNeXt-T on Imagenette

ConvNeXt (Liu et al. 2022, arXiv:2201.03545) was the “ConvNets strike back” paper: a deliberate back-port of design choices from the Swin Transformer back into a pure CNN, asking which of those choices were doing the actual work. The answer: depthwise \(7 \times 7\) convs at lower channel counts, LayerNorm instead of BatchNorm, GELU instead of ReLU, and an inverted- bottleneck-style \(4\times \) expansion in every block. None of those ingredients are transformer-specific. The architecture below is pure convolution top to bottom; the modernization is in the recipe, not the receptive field.

ConvNeXt-T is the smallest variant in the paper, \(\sim \)28M params (roughly the parameter budget of ResNet-50). At ImageNet-1K scale the paper reports \(82.1\% \) top-1, beating Swin-T at the same compute budget. At our 9.5K-image Imagenette scale, the data regime is different but the architectural story still holds.

The architecture

-- 1
import LeanMlir

-- 2
def convNextTiny : NetSpec where
  name   := "ConvNeXt-T-GELU"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 96 4 4 .same,                          -- patchify stem 224→56
    .convNextStage 96 3 .ln .gelu,                   -- 3 blocks @ 96 ch
    .convNextDownsample 96 192,                      -- LN + 2×2 stride 2 → 28
    .convNextStage 192 3 .ln .gelu,                  -- 3 blocks @ 192 ch
    .convNextDownsample 192 384,                     -- → 14
    .convNextStage 384 9 .ln .gelu,                  -- 9 blocks @ 384 ch
    .convNextDownsample 384 768,                     -- → 7
    .convNextStage 768 3 .ln .gelu,                  -- 3 blocks @ 768 ch
    .globalAvgPool,
    .dense 768 10 .identity
  ]

-- 3
def convNextTinyConfig : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  convNextTiny.train convNextTinyConfig
    (args.head?.getD "data/imagenette")

The structure follows the paper’s macro-design prescription: 4 stages with block counts \((3, 3, 9, 3)\) and channels doubling on every downsample \((96 \to 192 \to 384 \to 768)\), with the heavy 9-block stage at the \(14 \times 14\) / 384-channel resolution. .convNextStage is \(N\) ConvNeXt blocks (DW \(7\times 7\) + LN over channels + \(1\times 1\) expand \(4\times \) + GELU + \(1\times 1\) project + LayerScale + residual). The stem is a \(4 \times 4\) stride-4 conv (mirrored from ViT’s patch-embed), not the \(7 \times 7\) stride-2 stem of a ResNet — one of the seven modernization-recipe items called out in the paper. The same base recipe as the EnetB0 chapter (Adam @ 0.001, cosine, warmup-3, WD 1e-4, label smoothing 0.1) keeps the cross-architecture comparison clean.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/ablation convnext-tiny-gelu
ConvNeXt-T-GELU: 27826186 params
Generating train step MLIR...
  790422 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images
  27826186 params + m + v (318 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine warmup=3, label_smooth=0.1, wd=1e-4
  BN layers: 1, BN stat floats: 192
  step 0/295: loss=2.453635 (2030ms)
Epoch 10/80: loss=0.959386 lr=0.000985 (596s/epoch)
Epoch 20/80: loss=0.581696 lr=0.000897
Epoch 40/80: loss=0.522144 lr=0.000551
Epoch 60/80: loss=0.504293 lr=0.000173
Epoch 80/80: loss=0.501892 lr=0.000000
Saved params + BN stats.

$ LEAN_MLIR_EVAL_ONLY=1 \
    ./.lake/build/bin/ablation convnext-tiny-gelu
EVAL ONLY  ConvNeXt-T-GELU: 3316/3904 = 84.94%

Final val accuracy 84.94% on Imagenette, wall time \(\sim \)13.3 hours, train loss plateaus at the same \(\sim \)0.50 label-smoothing floor as ResNet-34 / MobileNet V2 / EfficientNet-B0. Extending the comparison table from the EfficientNet chapter:

Model	Params	MLIR	Step time	Total	Val acc
ResNet-34	21.29M	518 KB	1400 ms	9.5 h	90.29%
MobileNet V2	2.24M	741 KB	830 ms	5.4 h	87.09%
EfficientNet-B0	7.16M	938 KB	940 ms	6.2 h	87.58%
ConvNeXt-T	27.83M	790 KB	2030 ms	13.3 h	84.94%

ConvNeXt-T is the slowest per step (2030 ms vs EnetB0’s 940 ms) because depthwise \(7 \times 7\) convs at the early stages plus LayerNorm-over-channels touch every spatial position twice per block. The MLIR is mid-pack at 790 KB — smaller than EnetB0 because there’s no SE machinery, larger than ResNet-34 because LayerScale and per-stage downsamples bulk up the per-block emit.
84.94% is below EnetB0’s 87.58% on the same base recipe. Two things contribute: (a) ConvNeXt-T at 28M params has 4\(\times \) EnetB0’s parameter count and our 9.5K training images don’t have the data scale to fill that capacity without aug; (b) the ConvNeXt paper trains with the full DeiT recipe (Mixup, CutMix, RandAugment, Random Erasing) where the modernization actually lives. The next section layers those in — with proper aug, ConvNeXt-T’s accuracy at this scale catches up to or passes EnetB0.
The architecture is fine, the recipe is the lever. The 84.94% number we land here vs the paper’s 82.1% top-1 on ImageNet-1K is the same model getting different scores on different datasets, not a code-correctness issue.

9.2 Data Augmentation

ConvNeXt’s headline accuracy depends on a heavy augmentation pack inherited from DeiT: Mixup (Zhang et al. 2017, arXiv:1710.09412), CutMix (Yun et al. 2019, arXiv:1905.04899), Random Erasing (Zhong et al. 2017, arXiv:1708.04896), and RandAugment (Cubuk et al. 2019, arXiv:1909.13719). All four are implemented as data-only kernels that mutate the input tensor before the graph sees it — tier 1 in our codegen scope, no MLIR plumbing required. Mixup and CutMix route through a soft-label train-step variant since they produce fractional labels; Random Erasing and RandAugment leave labels alone.

Holding architecture and base optimizer fixed, we can measure the marginal effect of each augmentation knob layered on top of the bare config, on the same Imagenette data and 80-epoch budget.

Cell	Val acc	\(\Delta \) vs bare
convnext-tiny-gelu-cutmix	87.81%	\(+2.9\)
convnext-tiny-gelu-erase	85.63%	\(+0.7\)
convnext-tiny-gelu-randaug (M=9)	85.48%	\(+0.5\)
convnext-tiny-gelu (bare)	84.94%	—
convnext-tiny-gelu-mixup	83.45%	\(-1.5\)

CutMix is the load-bearing knob. \(+2.9\% \) over bare, a single config change. The ViT-Tiny ablation in Ch 10 reaches the same conclusion at a different architecture: CutMix alone is the biggest single lift.
Random Erasing and RandAugment at M=9 are in the same tier. \(+0.7\% \) and \(+0.5\% \) respectively; both at the edge of seed noise. M=9 is too aggressive for our 9.5K-image scale (the paper trained on 1.2M images), and erasing 25% of pixels is in the same noise band.
Mixup actively hurts at this scale. \(-1.5\% \) below bare. The blended-label gradient signal is too aggressive at \(\sim \)475 images per class; the model can’t extract a clean target from a Beta(0.8, 0.8) mix of two images when each class has so few exemplars. A Mixup ablation on full ImageNet (1.28M images, \(\sim \)1280 per class) typically lifts \(+0.5\) to \(+1.0\); here we’re 100\(\times \) below that data scale.

One seed per cell. The \(+0.5\% \) and \(+0.7\% \) deltas on RandAugment and Random Erasing are within noise; the \(+2.9\% \) delta from CutMix and the \(-1.5\% \) delta from Mixup are well above. The cross- architecture confirmation with Ch 10’s ViT-Tiny ablation (CutMix \(\gg \) RandAug at the same data scale) suggests the ranking is data-regime-driven, not architecture-driven.