Verified Deep Learning with Lean 4
Formal Backpropagation from MLP to Attention, via MLIR

9 ConvNeXt

ConvNets strike back

By 2022 the conventional read on image recognition had flipped. ResNet had been the default backbone for six years. Then in late 2020 ViT (Dosovitskiy et al., arXiv:2010.11929) showed that a pure transformer could match or beat ResNet on ImageNet at sufficient scale, and a year later Swin (Liu et al. 2021, arXiv:2103.14030) layered in hierarchical attention to win COCO/ADE20K on top of that. The research community’s working assumption became “transformers beat convnets for vision now.”

ConvNeXt (Liu et al. 2022, arXiv:2201.03545) was a deliberate stress test of that assumption. The authors started from a standard ResNet-50 and asked: which of the design differences between ResNet and Swin is actually doing the work? They proceeded by applying transformer-era choices to the ResNet one at a time and measuring each contribution in isolation. The full list:

  1. Training recipe modernization (300 epochs, AdamW, RandAug, Mixup, CutMix, label smoothing, stochastic depth, layer scale)

  2. Stage compute ratio change (from (3, 4, 6, 3) to (3, 3, 9, 3), matching Swin)

  3. “Patchify” stem (replace the 7\(\times \)7 stride-2 conv with a 4\(\times \)4 stride-4 conv, matching ViT’s patch embedding)

  4. Depthwise convolutions (split spatial and channel mixing, as in MobileNetV2)

  5. Inverted bottleneck (1\(\times \)1 expand 4\(\times \), depthwise, 1\(\times \)1 project, MobileNetV2-style)

  6. Larger kernel (depthwise 7\(\times \)7 instead of 3\(\times \)3, to approximate Swin’s window attention receptive field)

  7. Fewer activations and norms (one per block instead of one per layer, again matching transformer blocks)

  8. LayerNorm instead of BatchNorm

  9. GELU instead of ReLU

The result: a model that’s pure convolution top to bottom, doesn’t contain a single attention operation, and matches or beats Swin-T on ImageNet at the same compute budget. The conclusion of the paper isn’t “convolutions are back”—it’s “the architectural gap between ResNet and Swin was mostly the training recipe, not the receptive field.” Most of what looked like transformer superiority was modernization that hadn’t been back-ported to the convnet baseline.

Two new primitives

ConvNeXt requires two activations / normalizations that the chapters so far haven’t needed:

LayerNorm (Ba, Kiros, Hinton 2016, arXiv:1607.06450). BatchNorm normalizes per-feature across a batch: it relies on the batch having enough samples for the mean and variance to be stable. That fails for small batches (memory-constrained training) and for variable-length sequences (where “the batch axis” isn’t a clean concept). LayerNorm solves both by flipping the axis: normalize per-sample across the feature dimension instead. Same three-term Jacobian cancellation as BN (Chapter 5), same closed-form backward; just a different reduction axis. Theorem 44 states this formally.

GELU (Hendrycks & Gimpel 2016, arXiv:1606.08415). A smooth approximation of ReLU: \(\mathrm{GELU}(x) = x \cdot \Phi (x)\), where \(\Phi \) is the standard-normal CDF. ReLU has a kink at zero, which forces the codegen to substitute a subgradient convention (Chapter 3). GELU is differentiable everywhere, with a soft-gate behavior near zero that empirically helps transformers (and, it turns out, modernized CNNs). Diagonal activation Jacobian, same family as ReLU’s diagonal, just with a smooth scalar derivative instead of an indicator. Theorems 4145 provide the scalar function, its derivative, the Jacobian, and the assembled VJP.

Both primitives are used by the upcoming Vision Transformer chapter (Chapter 10) without modification. ConvNeXt introduces them here so they’re available before ViT needs them.

The ConvNeXt block

Each ConvNeXt block stacks the modernization items into a single residual unit:

\[ \text{DW } 7 \times 7 \; \to \; \text{LN} \; \to \; 1 \times 1\ \text{expand } 4\times \; \to \; \text{GELU} \; \to \; 1 \times 1\ \text{project} \; \to \; \text{LayerScale} \; \to \; +\, \text{residual}. \]

Two pieces are worth naming. The inverted bottleneck (\(1 \times 1\) expand 4\(\times \), depthwise spatial conv at the wide point, \(1 \times 1\) project) is exactly the MobileNetV2 inverted residual from Chapter 7—ConvNeXt is paying the MobileNet idea forward. LayerScale (Touvron et al. 2021) is a learnable per-channel scalar \(\gamma \) initialized to \(10^{-6}\) that multiplies the block’s contribution before the residual add, keeping the block near-identity at init in the same spirit as ResNet’s residual. The spec’s .convNextStage primitive emits \(N\) of these blocks at fixed channel width; the .convNextDownsample primitive emits the LN + \(2 \times 2\) stride-2 conv that transitions between stages.

ConvNeXt-T is four stages, ResNet-style

The full ConvNeXt-T spec is the same stem–stages–head template as ResNet-34: a \(4 \times 4\) stride-4 “patchify” stem replaces the \(7 \times 7\) stride-2 stem; four stages with block counts \((3, 3, 9, 3)\) at channel widths \((96, 192, 384, 768)\) replace the ResNet stages; a final global-average-pool plus 768-to-10 dense replaces the head. Total: 27.8M parameters, comparable to ResNet-50. Every architectural primitive in the spec is now codegen-backed and has a proved VJP; none of the math in this chapter is genuinely new beyond the LayerNorm-axis rotation and the GELU smoothness.

The bigger pedagogical point—and the one the example section below makes concrete—is that with a modernized convnet backbone, the dominant axis of variation between runs becomes the training recipe, not the architecture. The chapter finishes with a CutMix-vs-bare-vs-RandAugment ablation that lifts ConvNeXt-T’s Imagenette accuracy by 2.9 points without touching a single weight.

The theorems

Definition 41 GELU scalar function
#

Concrete \(0.5x(1 + \mathrm{erf}(x/\sqrt2))\).

Definition 42 GELU scalar derivative
#

Closed form for \(\mathrm{geluScalar}'\).

Theorem 43 GELU Jacobian
#

Diagonal activation Jacobian. Proved via fderiv_apply + chain rule with \(\mathrm{geluScalar} \circ \mathrm{ContinuousLinearMap.proj}\, j\), then fderiv_eq_smul_deriv to convert scalar \(\operatorname {fderiv}\) to \(\mathrm{deriv}\).

Proof

Mechanical; see Proofs.pdiv_gelu.

Theorem 44 LayerNorm VJP
#

LayerNorm = BatchNorm on a different axis; same primitive.

Proof

Mechanical; see Proofs.layerNorm_has_vjp.

Theorem 45 GELU VJP
#
Proof

Mechanical; see Proofs.gelu_has_vjp.

9.1 Example: ConvNeXt-T on Imagenette

ConvNeXt (Liu et al. 2022, arXiv:2201.03545) was the “ConvNets strike back” paper: a deliberate back-port of design choices from the Swin Transformer back into a pure CNN, asking which of those choices were doing the actual work. The answer: depthwise \(7 \times 7\) convs at lower channel counts, LayerNorm instead of BatchNorm, GELU instead of ReLU, and an inverted- bottleneck-style \(4\times \) expansion in every block. None of those ingredients are transformer-specific. The architecture below is pure convolution top to bottom; the modernization is in the recipe, not the receptive field.

ConvNeXt-T is the smallest variant in the paper, \(\sim \)28M params (roughly the parameter budget of ResNet-50). At ImageNet-1K scale the paper reports \(82.1\% \) top-1, beating Swin-T at the same compute budget. At our 9.5K-image Imagenette scale, the data regime is different but the architectural story still holds.

The architecture

-- 1
import LeanMlir

-- 2
def convNextTiny : NetSpec where
  name   := "ConvNeXt-T-GELU"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 96 4 4 .same,                          -- patchify stem 224→56
    .convNextStage 96 3 .ln .gelu,                   -- 3 blocks @ 96 ch
    .convNextDownsample 96 192,                      -- LN + 2×2 stride 2 → 28
    .convNextStage 192 3 .ln .gelu,                  -- 3 blocks @ 192 ch
    .convNextDownsample 192 384,                     -- → 14
    .convNextStage 384 9 .ln .gelu,                  -- 9 blocks @ 384 ch
    .convNextDownsample 384 768,                     -- → 7
    .convNextStage 768 3 .ln .gelu,                  -- 3 blocks @ 768 ch
    .globalAvgPool,
    .dense 768 10 .identity
  ]

-- 3
def convNextTinyConfig : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  convNextTiny.train convNextTinyConfig
    (args.head?.getD "data/imagenette")

The structure follows the paper’s macro-design prescription: 4 stages with block counts \((3, 3, 9, 3)\) and channels doubling on every downsample \((96 \to 192 \to 384 \to 768)\), with the heavy 9-block stage at the \(14 \times 14\) / 384-channel resolution. .convNextStage is \(N\) ConvNeXt blocks (DW \(7\times 7\) + LN over channels + \(1\times 1\) expand \(4\times \) + GELU + \(1\times 1\) project + LayerScale + residual). The stem is a \(4 \times 4\) stride-4 conv (mirrored from ViT’s patch-embed), not the \(7 \times 7\) stride-2 stem of a ResNet — one of the seven modernization-recipe items called out in the paper. The same base recipe as the EnetB0 chapter (Adam @ 0.001, cosine, warmup-3, WD 1e-4, label smoothing 0.1) keeps the cross-architecture comparison clean.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/ablation convnext-tiny-gelu
ConvNeXt-T-GELU: 27826186 params
Generating train step MLIR...
  790422 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images
  27826186 params + m + v (318 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine warmup=3, label_smooth=0.1, wd=1e-4
  BN layers: 1, BN stat floats: 192
  step 0/295: loss=2.453635 (2030ms)
Epoch 10/80: loss=0.959386 lr=0.000985 (596s/epoch)
Epoch 20/80: loss=0.581696 lr=0.000897
Epoch 40/80: loss=0.522144 lr=0.000551
Epoch 60/80: loss=0.504293 lr=0.000173
Epoch 80/80: loss=0.501892 lr=0.000000
Saved params + BN stats.

$ LEAN_MLIR_EVAL_ONLY=1 \
    ./.lake/build/bin/ablation convnext-tiny-gelu
EVAL ONLY  ConvNeXt-T-GELU: 3316/3904 = 84.94%

Final val accuracy 84.94% on Imagenette, wall time \(\sim \)13.3 hours, train loss plateaus at the same \(\sim \)0.50 label-smoothing floor as ResNet-34 / MobileNet V2 / EfficientNet-B0. Extending the comparison table from the EfficientNet chapter:

Model

Params

MLIR

Step time

Total

Val acc

ResNet-34

21.29M

518 KB

1400 ms

9.5 h

90.29%

MobileNet V2

2.24M

741 KB

830 ms

5.4 h

87.09%

EfficientNet-B0

7.16M

938 KB

940 ms

6.2 h

87.58%

ConvNeXt-T

27.83M

790 KB

2030 ms

13.3 h

84.94%

  • ConvNeXt-T is the slowest per step (2030 ms vs EnetB0’s 940 ms) because depthwise \(7 \times 7\) convs at the early stages plus LayerNorm-over-channels touch every spatial position twice per block. The MLIR is mid-pack at 790 KB — smaller than EnetB0 because there’s no SE machinery, larger than ResNet-34 because LayerScale and per-stage downsamples bulk up the per-block emit.

  • 84.94% is below EnetB0’s 87.58% on the same base recipe. Two things contribute: (a) ConvNeXt-T at 28M params has 4\(\times \) EnetB0’s parameter count and our 9.5K training images don’t have the data scale to fill that capacity without aug; (b) the ConvNeXt paper trains with the full DeiT recipe (Mixup, CutMix, RandAugment, Random Erasing) where the modernization actually lives. The next section layers those in — with proper aug, ConvNeXt-T’s accuracy at this scale catches up to or passes EnetB0.

  • The architecture is fine, the recipe is the lever. The 84.94% number we land here vs the paper’s 82.1% top-1 on ImageNet-1K is the same model getting different scores on different datasets, not a code-correctness issue.

9.2 MLIR: Layer Scale

Layer scale is a per-channel learnable diagonal, \(\mathrm{layerScale}(\gamma , x) = \gamma \odot x\) — the simplest operator in this series. Its Jacobian is diagonal, so it is its own adjoint: layerScale_has_vjp proves \(\mathrm{back}(x, dy) = \gamma \odot dy\), the backward multiplying by the same \(\gamma \) the forward did. There is nothing to bridge past that identity — the emitted graph is two multiplys against one tensor:

// ConvNeXt block tail:  out = x + layerScale(gamma, project(...))
%ls  = stablehlo.multiply %gls, %pr   : tensor<1x2x4x4xf32> // gamma (*) pr
%out = stablehlo.add %x, %ls          : tensor<1x2x4x4xf32> // + identity skip
// backward, cotangent %dOut:  d(pr) = gamma (*) d(out)
%dpr = stablehlo.multiply %gls, %dOut : tensor<1x2x4x4xf32>

The forward multiply %gls, %pr and the backward multiply %gls, %dOut are the same op against the same \(\gamma \) tensor: a diagonal map is its own transpose, so its VJP is itself. The identity skip is the residual fan-in of Chapter 6; the surrounding GELU, LayerNorm, and convolution backwards reuse their bridges — and, uniquely in this book, every one is unconditional (the only hypothesis anywhere is LayerNorm’s \(\epsilon {\gt} 0\)). The full block is \(\mathrm{residual}(\mathrm{layerScale} \circ \mathrm{project} \circ \mathrm{gelu} \circ \mathrm{expand} \circ \mathrm{LN} \circ \mathrm{depthwise})\), composed by convNextBlock_has_vjp.

9.3 Data Augmentation

ConvNeXt’s headline accuracy depends on a heavy augmentation pack inherited from DeiT: Mixup (Zhang et al. 2017, arXiv:1710.09412), CutMix (Yun et al. 2019, arXiv:1905.04899), Random Erasing (Zhong et al. 2017, arXiv:1708.04896), and RandAugment (Cubuk et al. 2019, arXiv:1909.13719). All four are implemented as data-only kernels that mutate the input tensor before the graph sees it — tier 1 in our codegen scope, no MLIR plumbing required. Mixup and CutMix route through a soft-label train-step variant since they produce fractional labels; Random Erasing and RandAugment leave labels alone.

Holding architecture and base optimizer fixed, we can measure the marginal effect of each augmentation knob layered on top of the bare config, on the same Imagenette data and 80-epoch budget.

Cell

Val acc

\(\Delta \) vs bare

convnext-tiny-gelu-cutmix

87.81%

\(+2.9\)

convnext-tiny-gelu-erase

85.63%

\(+0.7\)

convnext-tiny-gelu-randaug (M=9)

85.48%

\(+0.5\)

convnext-tiny-gelu (bare)

84.94%

convnext-tiny-gelu-mixup

83.45%

\(-1.5\)

  • CutMix is the load-bearing knob. \(+2.9\% \) over bare, a single config change. The ViT-Tiny ablation in Ch 10 reaches the same conclusion at a different architecture: CutMix alone is the biggest single lift.

  • Random Erasing and RandAugment at M=9 are in the same tier. \(+0.7\% \) and \(+0.5\% \) respectively; both at the edge of seed noise. M=9 is too aggressive for our 9.5K-image scale (the paper trained on 1.2M images), and erasing 25% of pixels is in the same noise band.

  • Mixup actively hurts at this scale. \(-1.5\% \) below bare. The blended-label gradient signal is too aggressive at \(\sim \)475 images per class; the model can’t extract a clean target from a Beta(0.8, 0.8) mix of two images when each class has so few exemplars. A Mixup ablation on full ImageNet (1.28M images, \(\sim \)1280 per class) typically lifts \(+0.5\) to \(+1.0\); here we’re 100\(\times \) below that data scale.

One seed per cell. The \(+0.5\% \) and \(+0.7\% \) deltas on RandAugment and Random Erasing are within noise; the \(+2.9\% \) delta from CutMix and the \(-1.5\% \) delta from Mixup are well above. The cross- architecture confirmation with Ch 10’s ViT-Tiny ablation (CutMix \(\gg \) RandAug at the same data scale) suggests the ranking is data-regime-driven, not architecture-driven.

9.4 ImageNet recipe

Why the phase-2 trainer? The ImageNet runs in this book use the phase-2 (Lean\(\to \)JAX) trainer: at this scale its job is to validate the framework’s logic end to end. Whether the phase-3 verified-IREE codegen can reach ImageNet under its own codegen rules is an open question — phase-2 is how we establish these baselines in the meantime.

ConvNeXt’s architecture first reached this book through the phase-3 IREE backend — the Imagenette example earlier in this chapter. The phase-2 port wires the two ConvNeXt-specific layers (.convNextStage, .convNextDownsample) into the JAX codegen; because that path differentiates with jax.value_and_grad, only the forward had to be written — the backward comes for free. The spec mirrors jax/MainConvNeXtImagenet.lean:

-- Same stage/downsample backbone, 1000 classes.
def convNeXtTinyImagenet : NetSpec where
  name   := "ConvNeXt-T (ImageNet, bf16)"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 96 4 4 .same,            -- patchify stem -> 56x56
    .convNextStage 96 3 .ln .gelu,     -- stage 1: 3 blocks @ 96
    .convNextDownsample 96 192,
    .convNextStage 192 3 .ln .gelu,    -- stage 2: 3 blocks @ 192
    .convNextDownsample 192 384,
    .convNextStage 384 9 .ln .gelu,    -- stage 3: 9 blocks @ 384
    .convNextDownsample 384 768,
    .convNextStage 768 3 .ln .gelu,    -- stage 4: 3 blocks @ 768
    .globalAvgPool,
    .dense 768 1000 .identity          -- 1000-class head
  ]

-- ConvNeXt needs AdamW, not SGD.
def convNeXtTinyImagenetConfig : TrainConfig where
  learningRate   := 4e-4            -- 4e-3 @ batch 4096, scaled to 256
  batchSize      := 256
  epochs         := 80             -- validation tier of an 80->300 ladder
  useAdam        := true            -- AdamW (decoupled weight decay)
  weightDecay    := 0.05
  cosineDecay    := true
  warmupEpochs   := 5
  gradClipNorm   := 1.0             -- cheap insurance (unlocked the ViT run)
  augment        := true
  labelSmoothing := 0.1
  bf16           := true
  bf16Conv       := true            -- dw-7x7 + 1x1s; LN/GELU stay fp32
  useEMA         := true            -- weight averaging, decay 0.9999
  dropPath       := 0.1             -- stochastic depth (ConvNeXt-T paper value)

Unlike MobileNet and EfficientNet, ConvNeXt trains with AdamW and a small decoupled weight decay, not SGD — the modernized-convnet recipe is the architecture’s whole thesis. bf16Conv pays off especially here: the \(7\times 7\) depthwise is \(\sim 2.3\times \) faster in bfloat16 (a large kernel has enough arithmetic for the tensor cores to bite, unlike the \(3\times 3\) depthwise, which is a wash), and the inverted-bottleneck \(1\times 1\)s are matmuls that love it; the channel LayerNorm and GELU stay in fp32.

The two regularizers core to ConvNeXt’s 300-epoch recipe — stochastic depth (drop-path 0.1, the paper value) and EMA (weight averaging, decay 0.9999) — are now wired and on (dropPath and useEMA above), and cost \({{\lt}}5\% \) per epoch. One canonical deviation remains, deliberate so the port reuses the validated layer types: the patchify stem is .convBn (BN \(+\) a ReLU) rather than conv \(+\) LayerNorm — param-equivalent, the ReLU a minor variant. The augmentation pack above — Mixup, CutMix, RandAugment, Random Erasing — is wired into the phase-2 Lean\(\to \)JAX trainer (each knob gated on a config flag, RandAugment as a color-only “lite” variant [TODO: wire geometric RandAugment]) but left off here, so an 80-epoch run lands under the paper’s \(82.1\% \); the 80 here is the validation tier of an 80\(\to \)300 ladder.

Compute budget. ConvNeXt-T is the heaviest of the three (\(28.6\)M parameters — the canonical count, verified by an init/forward/backward/save/reload round-trip). Run on all six RTX 4060 Ti (CUDA, bf16, batch 252 = 6\(\times \)42), steady-state throughput is \(\sim \)143 ms per step, about \(12.6\) minutes per epoch — the most expensive per-epoch of any net in this book, driven by the \(7\times 7\) depthwise and the channel LayerNorms. Unlike the lighter EfficientNet, ConvNeXt gets a healthy \(\sim \)1.27\(\times \) from four to six GPUs (it ran \(\sim \)185 ms/step on four): being compute-bound, it keeps the cards near \(100\% \) utilization rather than starving on the host data pipeline, so more of the added GPUs actually turns.

GPU

Epochs

Per epoch

Wall-clock

Val top-1

Val top-5

6\(\times \) 4060 Ti

80

\(\sim \)12.6 min

\(\sim \)15.5 hr

\(\mathbf{75.93\% }\)

\(\mathbf{92.27\% }\)

6\(\times \) 4060 Ti

300

\(\sim \)12.6 min

\(\sim \)63 hr

(CUDA, bf16. The 300-epoch row is the projected wall-clock; accuracy pending.) The 80-epoch run reached \(\mathbf{75.93\% }\) top-1 / \(\mathbf{92.27\% }\) top-5 on the full 50,000-image validation split — the strongest result in this book’s sweep, ahead of EfficientNet-B0 (\(72.3\% \)) and ResNet-34 (\(72.0\% \)). The modern recipe earns it: AdamW, LayerScale, stochastic depth, and EMA are all in; the remaining gap to ConvNeXt-T’s \(\sim \)82% headline is geometric RandAugment [TODO: wire geometric RandAugment] and the 300-epoch schedule. The validation curve — note the slow start, an artifact of the EMA shadow (decay \(0.9999\)) lagging the live weights for the first few epochs, then catching up:

\begin{tikzpicture} 
\begin{axis}[
    width=0.92\linewidth, height=6.5cm,
    xlabel={Epoch}, ylabel={Validation accuracy (\%)},
    xmin=0, xmax=81, ymin=0, ymax=95,
    xtick={0,10,20,30,40,50,60,70,80}, ytick={20,40,60,80},
    legend pos=south east, legend cell align={left},
    grid=major, grid style={gray!18},
    tick label style={font=\small}, label style={font=\small},
    every axis plot/.append style={line width=1pt, mark size=1pt},
]
\addplot[blue, mark=*, mark options={fill=blue}] coordinates {
(2,1.43) (3,2.85) (4,4.05) (5,11.20) (7,33.64) (8,41.50) (9,48.69) (10,55.26) (11,60.44) (12,63.92) (13,66.44) (14,68.02) (16,70.20) (17,70.97) (18,71.58) (19,72.11) (21,72.94) (22,73.23) (23,73.53) (24,73.75) (26,74.31) (27,74.41) (28,74.65) (29,74.80) (31,75.05) (32,75.15) (33,75.22) (34,75.47) (36,75.63) (37,75.63) (38,75.73) (39,75.86) (41,76.02) (42,76.03) (43,76.06) (44,76.02) (46,76.11) (47,76.15) (48,76.28) (49,76.23) (51,76.27) (52,76.22) (53,76.23) (54,76.20) (56,76.23) (57,76.18) (58,76.14) (59,76.12) (60,76.13) (61,76.07) (62,76.03) (63,75.99) (66,75.98) (67,75.92) (68,76.02) (69,75.99) (70,76.03) (71,76.00) (79,75.94) (80,75.93)
};
\addlegendentry{top-1}
\addplot[orange, mark=*, mark options={fill=orange}] coordinates {
(2,5.18) (3,8.68) (4,11.37) (5,26.99) (7,58.74) (8,67.14) (9,73.59) (10,78.95) (11,82.81) (12,85.21) (13,86.82) (14,87.94) (16,89.25) (17,89.65) (18,90.11) (19,90.44) (21,91.01) (22,91.27) (23,91.42) (24,91.57) (26,91.86) (27,91.94) (28,91.98) (29,92.08) (31,92.26) (32,92.32) (33,92.38) (34,92.34) (36,92.43) (37,92.51) (38,92.54) (39,92.58) (41,92.64) (42,92.61) (43,92.66) (44,92.67) (46,92.61) (47,92.62) (48,92.68) (49,92.62) (51,92.62) (52,92.64) (53,92.67) (54,92.66) (56,92.58) (57,92.59) (58,92.54) (59,92.57) (60,92.56) (61,92.52) (62,92.47) (63,92.43) (66,92.44) (67,92.43) (68,92.43) (69,92.36) (70,92.38) (71,92.35) (79,92.24) (80,92.24)
};
\addlegendentry{top-5}
\end{axis}
\end{tikzpicture}

ConvNeXt-T / ImageNet-1k validation accuracy per epoch (bf16, 6\(\times \) 4060 Ti).