Verified Deep Learning with Lean 4
Formal Backpropagation from MLP to Attention, via MLIR

8 EfficientNet

A convolution treats every channel equally

By the time a feature map reaches a deep stage of a CNN, each channel is a learned “feature detector” for something specific—a headlight, a curve, a fur texture, a leaf edge. A standard \(3 \times 3\) convolution that consumes this feature map treats every one of those channels as equally important contributors to every output channel. Its kernel learns a fixed mixing recipe that’s applied to every spatial position of every input image, regardless of what’s actually in the image.

That’s wasteful in a particular way. A “headlight” channel is useful when there’s a car in the image and useless on a beach. A “leaf-edge” channel is useful in a tree-heavy region and useless in the sky. The network has no built-in way to say “for this image, dial channel 47 up and channel 112 down.” Hu et al. 2018 (arXiv:1709.01507) designed a tiny module that lets the network do exactly that: look at the global content of the feature map, produce a per-channel gain, and rescale every channel by its gain before passing the feature map forward.

That module is the Squeeze-and-Excitation block, and at its core it is a learned per-channel attention mechanism that predates the transformer-era “attention” by four years and informs every post-2018 vision architecture.

Three steps, one tensor product

An SE block on a feature map \(x \in \mathbb {R}^{B \times C \times H \times W}\) runs three operations in sequence:

  1. Squeeze. Global-average-pool across spatial dimensions: \(s_c = \frac{1}{HW} \sum _{i,j} x_{c, i, j}\). Result is a \(B \times C\) tensor—one scalar per image per channel, summarizing the channel’s overall activity.

  2. Excite. Push \(s\) through a tiny two-layer MLP with a bottleneck: \(g = \sigma (W_2\, \mathrm{ReLU}(W_1 s))\), where \(W_1\) is \(C \to C/r\) and \(W_2\) is \(C/r \to C\) for a reduction ratio \(r\) (typically 16). The sigmoid squashes each output to \([0, 1]\). Result: \(g \in \mathbb {R}^{B \times C}\), a per-channel gate value.

  3. Apply. Multiply the original feature map by the gates, broadcast across the spatial axes: \(y_{c, i, j} = g_c \cdot x_{c, i, j}\). Channels with gate near 1 pass through; channels with gate near 0 are suppressed.

That’s it. The MLP operates on a \([B, C]\) tensor, not \([B, C, H, W]\), so it’s tiny—\(C^2 / r + C/r + C\) parameters per block, vs millions for a standard conv at the same width.

From the framework: it’s an elementwise product

What is the SE block, as a function? Up to a learned scalar factor per channel, it’s just an elementwise product of the input with a side branch:

\[ \mathrm{se}(x) \; =\; x \; \odot \; g(\mathrm{gap}(x)), \]

where the gate \(g\) is built from the saved input \(x\) via two dense layers and a sigmoid. We already proved the VJP of an elementwise product (Chapter 2). We already proved the VJP of dense (Chapter 3). We already proved the VJP of identity, and an additive fan-in collapses to the same product-rule structure as elementwise. Theorem 40 is one line of composition; it introduces no new analytic dependencies.

Attention in Disguise

The mechanism is worth naming clearly: SE looks at the entire feature map, produces a content-dependent re-weighting, and applies it back to the data. That is a learned attention operation in the standard sense. The reduction is over the spatial axes (so each token-equivalent is a channel, not a patch), but the shape is the same: a query (GAP summary) attending over a collection (the channels) to produce weights, weights gating the original signal.

The transformer’s self-attention (Vaswani et al. 2017) generalizes the same pattern from a fixed query (the GAP) to a learned sequence of queries (the token positions) and from a fixed key/value (the channels) to a learned key/value (per-token features). SE is the bare minimum version of that idea. Every post-2018 vision architecture has SE or one of its descendants (CBAM, ECA, GE) bolted in, because the accuracy-per-parameter cost is low and the operation composes cleanly with anything else.

EfficientNet-B0: MobileNet-V2 with SE everywhere

EfficientNet-B0 (Tan & Le 2019, arXiv:1905.11946) is, at the architectural level, MobileNet-V2 with two changes:

  1. Every .mbConv block contains an SE sub-module after the depthwise conv. That’s the useSE = true flag in the spec.

  2. Swish (\(x \cdot \sigma (x)\)) replaces ReLU as the activation. Smoother gradient, modest empirical lift, no new VJP structure— just chain rule through a sigmoid.

The paper’s actual thesis is compound scaling: jointly scale depth, width, and input resolution by a single coefficient \(\phi \) and you get the EfficientNet family B0 through B7, with B7 reaching ImageNet state-of-the-art at the time. At the B0 baseline (the one in this book), the architectural lift over MobileNet-V2 is essentially SE plus Swish; the compound scaling story lives in the B0\(\to \)B7 progression that we don’t fit on Imagenette because the dataset is too small for the return-on-scaling to show up.

The theorem

Theorem 40 SE block VJP
#

SE multiplies input by a sigmoid-gated channel mask.

Proof

Mechanical; see Proofs.seBlock_has_vjp.

8.1 Example: EfficientNet-B0 on Imagenette

Squeeze-and-Excitation by itself is one of the smallest architectural contributions in deep learning. Take a feature map, compute a per-channel scalar by global-average-pooling it, pass those scalars through a tiny two-layer MLP with a sigmoid at the end, multiply those sigmoid outputs back into the original feature map per-channel. That’s it. It’s a learned per-channel reweighting — a particularly simple form of attention, predating the transformer-era use of “attention” by four years (Hu et al. 2018, Squeeze-and- Excitation Networks).

EfficientNet (Tan & Le 2019) bolts SE into every .mbConv block in what’s otherwise a MobileNet-V2-shaped architecture. The paper’s actual contribution is compound scaling (tune depth, width, and input resolution together) but the B0 baseline’s architectural novelty over MobileNet V2 is essentially “add SE blocks.”

The architecture

-- 1
import LeanMlir

-- 2
def efficientNetB0 : NetSpec where
  name   := "EfficientNet-B0"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,                          -- stem 224→112
    .mbConv  32  16 1 3 1 1 true,                    -- 112
    .mbConv  16  24 6 3 2 2 true,                    -- 112→56
    .mbConv  24  40 6 5 2 2 true,                    -- 56→28
    .mbConv  40  80 6 3 2 3 true,                    -- 28→14
    .mbConv  80 112 6 5 1 3 true,                    -- 14
    .mbConv 112 192 6 5 2 4 true,                    -- 14→7
    .mbConv 192 320 6 3 1 1 true,                    -- 7
    .convBn 320 1280 1 1 .same,                      -- 1×1 to 1280
    .globalAvgPool,
    .dense 1280 10 .identity
  ]

-- 3
def efficientNetB0Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  efficientNetB0.train efficientNetB0Config
    (args.head?.getD "data/imagenette")

The .mbConv primitive takes (ic, oc, expandRatio, kernel, stride, nBlocks, useSE). Kernel sizes vary by stage (EfficientNet uses a mix of \(3 \times 3\) and \(5 \times 5\) depthwise convs), and useSE = true in every block here turns on the Squeeze-and-Excitation sub-module. Every SE block multiplies the input by the sigmoid output of its two-layer MLP: that’s exactly § 40 composed with § 11 and the already- proved dense VJPs from Chapter 3. No new math in this chapter beyond the SE block itself.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/efficientnet-train
EfficientNet-B0: 7155658 params
Generating train step MLIR...
  938160 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images (256×256)
  7155658 params + m + v (81 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 49, BN stat floats: 42016
  step 0/295: loss=2.345573 (956ms)
Epoch 10/80: loss=0.947504 lr=0.000985 (280713ms)
  val accuracy (running BN): 3077/3904 = 78.82%
Epoch 20/80: loss=0.711379 lr=0.000897 (280782ms)
  val accuracy (running BN): 3208/3904 = 82.17%
Epoch 40/80: loss=0.548946 lr=0.000551 (280416ms)
  val accuracy (running BN): 3363/3904 = 86.14%
Epoch 60/80: loss=0.512627 lr=0.000173 (281250ms)
  val accuracy (running BN): 3410/3904 = 87.35%
Epoch 80/80: loss=0.503827 lr=0.000000 (281927ms)
  val accuracy (running BN): 3419/3904 = 87.58%

(Full log in logs/effnet_se.log.)

Final val accuracy 87.58% on Imagenette, wall time about 6.2 hours, loss plateaus at the same \(\sim \)0.50 label-smoothing floor as ResNet-34 and MobileNet V2. Three-chapter comparison, same dataset, same training recipe:

Model

Params

MLIR

Step time

Total

Val acc

ResNet-34

21.29M

518 KB

1400 ms

9.5 h

90.29%

MobileNet V2

2.24M

741 KB

830 ms

5.4 h

87.09%

EfficientNet-B0

7.16M

938 KB

940 ms

6.2 h

87.58%

- EfficientNet-B0 is 3.2\(\times \) the parameters of MobileNet V2 for a 0.5 percentage-point accuracy lift (87.58% vs 87.09%). On a pure params-vs-accuracy curve, SE is an expensive addition — the parameter cost comes from the two-layer MLP in every SE sub-block, and every .mbConv in B0 has one. Seven stages \(\times \) 1–4 blocks each = 16+ SE blocks across the network.

- 938 KB of MLIR is the largest yet — more than either ResNet-34 (518 KB) or MobileNet V2 (741 KB). The extra ops come from SE’s per-channel attention: each SE block emits a GAP, two denses, a sigmoid, and an elementwise-multiply. Over 16 blocks that’s non-trivial StableHLO.

- Per-step time is 940 ms, intermediate between R34 and MNv2. SE’s extra ops slow things down vs plain depthwise-separable, but the core is still depthwise-separable so it remains cheaper than ResNet-34’s full convs.

- The lift isn’t from SE alone. EfficientNet’s paper thesis is that compound scaling — tuning depth, width, and resolution jointly with a scaling coefficient — produces the ImageNet state-of-the-art. At the B0 baseline shown here, SE gives a modest bump; the real wins come at B3–B7 where the compound scaling compounds. Our Imagenette example sits at B0 because the dataset is 10-class Imagenette, not 1000-class ImageNet, and the return-on-scaling saturates much earlier.

Every post-2018 vision architecture has some variant of SE in it (MobileNet V3, EfficientNetV2, ConvNeXt, RegNetZ) because the accuracy-per-parameter cost is low when you compound it with other scaling tricks. The other architectural contribution EfficientNet shipped — the Swish activation \(\mathrm{swish}(x) = x \cdot \sigma (x)\) — is orthogonal to SE.

8.2 MLIR: Squeeze-and-Excitation

What is already proven. Squeeze-excitation is an elementwise product \(\mathrm{se}(x) = x \odot g(x)\), where the per-channel gate \(g(x)\) is itself a small sub-network of the same input \(x\) — global-average-pool, dense, swish, dense, sigmoid, broadcast. Because the gate depends on \(x\), the VJP is a product rule, not a plain scaling: seBlock_has_vjp proves

\[ dx = g(x)\odot dy \; +\; g.\mathrm{back}\bigl(x \odot dy\bigr), \]

the gate-weighted cotangent plus the gate’s own backward applied to the co-input \(x\odot dy\). seBlockFull_has_vjp supplies the concrete gate (the GAP–dense–swish–dense–sigmoid chain) and mbconvBody_has_vjp composes the whole MBConv.

The gap, and how we close it. The emitted graph realizes seBlock_has_vjp’s product-rule backward in full — both terms — each piece of the gate carrying the swish, sigmoid, and dense bridges of the previous chapters. Here is the fan-in, with the excite sub-network’s internal VJP elided to its one comment line (the SE block runs on the MBConv’s four mid-channels):

// se = x * broadcast(gate(x)); gate = squeeze-excite sub-network
// %dse = cotangent into the SE block;  %ggb2 = broadcast(gate)
%gdleft = stablehlo.multiply %ggb2, %dse : tensor<1x4x4x4xf32>
%gxdse = stablehlo.multiply %d_s4, %dse : tensor<1x4x4x4xf32>
%gdgate = stablehlo.reduce(%gxdse init: %sc)
            applies stablehlo.add across dimensions = [2, 3]
            : (tensor<1x4x4x4xf32>, tensor<f32>) -> tensor<1x4xf32>
// ... gate VJP through the excite MLP (sigmoid/dense/swish/dense) ...
%gdgate_sp = stablehlo.broadcast_in_dim %gdsq_d, dims = [0, 1]
            : (tensor<1x4xf32>) -> tensor<1x4x4x4xf32>
%gdds = stablehlo.add %gdleft, %gdgate_sp : tensor<1x4x4x4xf32>

Read it against the product rule: %gdleft is \(g(x)\odot dy\), the term a naive generator would stop at; %gdgate reduces \(x \odot dy\) over the spatial axes — the adjoint of the gate’s broadcast, i.e. the cotangent on the per-channel gate values — which then flows back through the excite MLP and the global-average-pool to %gdgate_sp; and %gdds = %gdleft + %gdgate_sp adds the two. That final add is the product rule: the gradient that passes straight through the channel, plus the gradient that re-tunes the gate, summed.

As with attention, the listing shows the new structural move — the product-rule fan-in — and elides the work behind it: the // gate VJP comment stands for the excite MLP’s own backward (sigmoid, dense, swish, dense), each carried by this chapter’s bridges and composed in seBlockFull_has_vjp.

Caveats.

  • The SE fan-in carries no kink condition — swish and sigmoid are smooth everywhere; the only smooth-point caveats in an MBConv come from its BatchNorms (\(\epsilon {\gt} 0\)).

  • Representative scale (four mid-channels, two in the squeeze).

8.3 ImageNet recipe

Why the phase-2 trainer? The ImageNet runs in this book use the phase-2 (Lean\(\to \)JAX) trainer: at this scale its job is to validate the framework’s logic end to end. Whether the phase-3 verified-IREE codegen can reach ImageNet under its own codegen rules is an open question — phase-2 is how we establish these baselines in the meantime.

As with ResNet-34 (Chapter 6) and MobileNet-V2, the B0 backbone scales to full 1000-class ImageNet unchanged — the MBConv stack stays put, the head widens to 1000, the schedule grows. The spec mirrors jax/MainEfficientNetImagenet.lean:

-- Same MBConv (SE + Swish) backbone, 1000 classes instead of 10.
def efficientNetB0Imagenet : NetSpec where
  name   := "EfficientNet-B0 (ImageNet, bf16)"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,
    .mbConv  32  16 1 3 1 1 true,
    .mbConv  16  24 6 3 2 2 true,
    .mbConv  24  40 6 5 2 2 true,
    .mbConv  40  80 6 3 2 3 true,
    .mbConv  80 112 6 5 1 3 true,
    .mbConv 112 192 6 5 2 4 true,
    .mbConv 192 320 6 3 1 1 true,
    .convBn 320 1280 1 1 .same,
    .globalAvgPool,
    .dense 1280 1000 .identity      -- 1000-class head
  ]

-- SGD recipe (mirrors MobileNet-V2); small weight decay, bf16 conv on.
def efficientNetB0ImagenetConfig : TrainConfig where
  learningRate := 0.1
  batchSize    := 256
  epochs       := 80               -- validation tier of an 80->300 ladder
  useAdam      := false             -- SGD + momentum
  momentum     := 0.9
  weightDecay  := 1e-5              -- small: protects depthwise/SE params
  cosineDecay    := true
  warmupEpochs   := 5
  augment        := true
  labelSmoothing := 0.1
  bf16           := true
  bf16Conv       := true            -- reaches the MBConv expand/dw/project
  useEMA         := true            -- weight averaging, decay 0.9999
  dropPath       := 0.2             -- stochastic depth (B0 drop-connect rate)

Two details are specific to EfficientNet. First, bf16Conv casts the MBConv block’s heavy convolutions — the \(1\times 1\) expand, the depthwise, the \(1\times 1\) project — but deliberately leaves the squeeze-excitation \(1\times 1\)s in fp32: SE acts on a globally-pooled \(1\times 1\) tensor where there is no throughput to win, and its sigmoid gate is precision-sensitive. The payoff is the same \(\sim 2\times \) MBConv-block speedup MobileNet sees, from exactly the convolutions worth casting.

Second — a correctness note worth surfacing — the SE bottleneck is sized off the block’s input channels (\(c/4\)), not the \(6\times \)-larger expanded width. An earlier version of the codegen sized it off the expanded width and inflated B0 from its canonical \(5.3\)M parameters to \(8.4\)M; sizing the squeeze correctly restores the faithful \(\mathbf{5.29}\)M-parameter B0.

Augmentation. Like MobileNet-V2, the run uses base aug only: Inception-style random-resized-crop (sampling 8–100% of the image area at a 3/4–4/3 aspect ratio, resized to \(224\times 224\)) plus a random horizontal flip, with label smoothing \(0.1\) in the loss; validation center-crops. The heavier pack EfficientNet’s headline recipe leans on — Mixup, CutMix, RandAugment, Random Erasing — is wired into the phase-2 Lean\(\to \)JAX trainer (emitted into the generated tf.data pipeline, each knob gated on a config flag, RandAugment as a color-only “lite” variant) but left off at this validation tier.

Compute budget. This was the first run on all six RTX 4060 Ti (CUDA, bf16, batch 252 = 6\(\times \)42), steady-state \(\sim \)90 ms per step, about \(8.0\) minutes per epoch — so the 80-epoch run took \(\sim \)10.5 wall-clock hours. Six GPUs only buy \(\sim \)1.17\(\times \) over four here (the same B0 ran \(\sim \)102 ms/step on four): with the model held at batch 256, the bottleneck moves off the GPUs and onto the host data pipeline, which cannot decode and augment fast enough to keep six cards fed (they sit at \({\sim }65\% \) utilization). It is not an interconnect limit — the gradient all-reduce is a couple milliseconds of a \(\sim \)90 ms step.

GPU

Epochs

Per epoch

Wall-clock

Val top-1

Val top-5

6\(\times \) 4060 Ti

80

\(\sim \)8.0 min

\(\sim \)10.5 hr

\(\mathbf{72.31\% }\)

\(\mathbf{90.31\% }\)

6\(\times \) 4060 Ti

300

\(\sim \)8.0 min

\(\sim \)40 hr

(CUDA, bf16. The 300-epoch row is the projected wall-clock for the full-recipe run; accuracy pending.) The 80-epoch run reached \(\mathbf{72.31\% }\) top-1 / \(\mathbf{90.31\% }\) top-5 on the full 50,000-image validation split — the strongest result in this book’s sweep, edging ResNet-34’s \(72.0\% \) at roughly a quarter of the parameters (\(5.3\)M vs \(21.8\)M). It still sits under EfficientNet’s headline \({\sim }77\% \), which comes from a far longer schedule with RMSProp, AutoAugment (the trainer substitutes RandAugment), stochastic depth, and EMA. Of those, EMA (decay 0.9999) and stochastic depth (drop-connect 0.2) are on here, costing \({{\lt}}5\% \) per epoch; RMSProp, the geometric half of RandAugment [TODO: wire RMSProp + geometric RandAugment], and the 300-epoch length remain the gap to the paper number. The validation curve:

\begin{tikzpicture} 
\begin{axis}[
    width=0.92\linewidth, height=6.5cm,
    xlabel={Epoch}, ylabel={Validation accuracy (\%)},
    xmin=0, xmax=81, ymin=0, ymax=95,
    xtick={0,10,20,30,40,50,60,70,80}, ytick={20,40,60,80},
    legend pos=south east, legend cell align={left},
    grid=major, grid style={gray!18},
    tick label style={font=\small}, label style={font=\small},
    every axis plot/.append style={line width=1pt, mark size=1pt},
]
\addplot[blue, mark=*, mark options={fill=blue}] coordinates {
(2,4.20) (3,16.40) (4,28.61) (5,38.10) (6,44.90) (7,49.72) (12,61.12) (13,62.18) (14,62.93) (15,63.78) (19,66.16) (20,66.60) (21,67.11) (22,67.44) (26,68.45) (27,68.48) (28,68.72) (29,68.98) (34,69.90) (35,70.02) (36,70.16) (37,70.35) (42,71.00) (43,71.03) (44,71.09) (45,71.33) (49,71.64) (50,71.73) (51,71.72) (52,71.79) (57,71.97) (58,72.02) (59,72.04) (60,72.08) (66,72.33) (67,72.36) (68,72.37) (69,72.37) (70,72.40) (71,72.40) (72,72.36) (73,72.39) (74,72.41) (76,72.38) (77,72.44) (78,72.41) (79,72.45) (80,72.43)
};
\addlegendentry{top-1}
\addplot[orange, mark=*, mark options={fill=orange}] coordinates {
(2,12.99) (3,36.42) (4,53.69) (5,64.01) (6,70.51) (7,74.62) (12,83.34) (13,84.14) (14,84.76) (15,85.31) (19,86.76) (20,87.04) (21,87.32) (22,87.54) (26,88.21) (27,88.28) (28,88.41) (29,88.49) (34,89.00) (35,89.05) (36,89.18) (37,89.28) (42,89.64) (43,89.70) (44,89.70) (45,89.75) (49,89.93) (50,89.98) (51,90.07) (52,90.12) (57,90.18) (58,90.17) (59,90.22) (60,90.25) (66,90.26) (67,90.34) (68,90.31) (69,90.32) (70,90.30) (71,90.33) (72,90.34) (73,90.35) (74,90.34) (76,90.36) (77,90.33) (78,90.34) (79,90.34) (80,90.33)
};
\addlegendentry{top-5}
\end{axis}
\end{tikzpicture}

EfficientNet-B0 / ImageNet-1k validation accuracy per epoch (bf16, 6\(\times \) 4060 Ti).

8.4 Side quests

Two EfficientNet follow-ups extended the family in orthogonal directions but neither lands in this book’s spec. Pointers for the curious:

Noisy Student (Xie et al. 2019). Semi-supervised distillation that pushed EfficientNet-L2 to \(88.4\% \) top-1 on ImageNet, a state-of-the-art mark at the time. The recipe: train a teacher on labeled ImageNet, use the teacher to pseudo-label JFT-300M (\(\sim \)300M unlabeled images), then train a larger student on the combined dataset with heavy augmentation (RandAugment, dropout, stochastic depth) — the “noise” the title refers to. Iterate.

What’s interesting from this book’s perspective: the architecture doesn’t change. Every Noisy Student model is still EfficientNet-B\(k\) or -L\(k\). The lift comes entirely from the data plus training recipe, not new ops. Our enetB0SwishSpec could be the teacher in a Noisy Student loop unchanged — the missing pieces are the unlabeled corpus and the iteration harness, not the proof side.

EfficientDet (Tan et al. 2020). EfficientNet backbone plus BiFPN (bidirectional feature pyramid) for object detection. BiFPN aggregates multi-scale features by fusing each spatial resolution with both its coarser and finer neighbors via weighted-sum cross-scale connections. The detection head is a small per-anchor MLP predicting class plus box coordinates, applied at every BiFPN output level.

What it would take to add: BiFPN is a few rounds of convBn plus resize-to-match-spatial-resolution plus weighted add. The cross-scale resize is the only new primitive — nearest-neighbor or bilinear upsample / downsample. Every other component (conv, BN, dense head) is already proved. The detection-side loss (focal loss plus Smooth-L1 box regression) is also new but it sits at the leaves of the backward graph; it doesn’t touch the proofs threading through the architecture.

These two side quests sit on the bestiary edge: the underlying VJP proofs in Proofs/ would hold without modification; what’s missing is the surrounding scaffolding — a data pipeline for Noisy Student, a multi-scale head plus detection loss for EfficientDet.