Verified Deep Learning with Lean 4

8 EfficientNet

A convolution treats every channel equally

By the time a feature map reaches a deep stage of a CNN, each channel is a learned “feature detector” for something specific—a headlight, a curve, a fur texture, a leaf edge. A standard \(3 \times 3\) convolution that consumes this feature map treats every one of those channels as equally important contributors to every output channel. Its kernel learns a fixed mixing recipe that’s applied to every spatial position of every input image, regardless of what’s actually in the image.

That’s wasteful in a particular way. A “headlight” channel is useful when there’s a car in the image and useless on a beach. A “leaf-edge” channel is useful in a tree-heavy region and useless in the sky. The network has no built-in way to say “for this image, dial channel 47 up and channel 112 down.” Hu et al. 2018 (arXiv:1709.01507) designed a tiny module that lets the network do exactly that: look at the global content of the feature map, produce a per-channel gain, and rescale every channel by its gain before passing the feature map forward.

That module is the Squeeze-and-Excitation block, and at its core it is a learned per-channel attention mechanism that predates the transformer-era “attention” by four years and informs every post-2018 vision architecture.

Three steps, one tensor product

An SE block on a feature map \(x \in \mathbb {R}^{B \times C \times H \times W}\) runs three operations in sequence:

  1. Squeeze. Global-average-pool across spatial dimensions: \(s_c = \frac{1}{HW} \sum _{i,j} x_{c, i, j}\). Result is a \(B \times C\) tensor—one scalar per image per channel, summarizing the channel’s overall activity.

  2. Excite. Push \(s\) through a tiny two-layer MLP with a bottleneck: \(g = \sigma (W_2\, \mathrm{ReLU}(W_1 s))\), where \(W_1\) is \(C \to C/r\) and \(W_2\) is \(C/r \to C\) for a reduction ratio \(r\) (typically 16). The sigmoid squashes each output to \([0, 1]\). Result: \(g \in \mathbb {R}^{B \times C}\), a per-channel gate value.

  3. Apply. Multiply the original feature map by the gates, broadcast across the spatial axes: \(y_{c, i, j} = g_c \cdot x_{c, i, j}\). Channels with gate near 1 pass through; channels with gate near 0 are suppressed.

That’s it. The MLP operates on a \([B, C]\) tensor, not \([B, C, H, W]\), so it’s tiny—\(C^2 / r + C/r + C\) parameters per block, vs millions for a standard conv at the same width.

From the framework: it’s an elementwise product

What is the SE block, as a function? Up to a learned scalar factor per channel, it’s just an elementwise product of the input with a side branch:

\[ \mathrm{se}(x) \; =\; x \; \odot \; g(\mathrm{gap}(x)), \]

where the gate \(g\) is built from the saved input \(x\) via two dense layers and a sigmoid. We already proved the VJP of an elementwise product (Chapter 2). We already proved the VJP of dense (Chapter 3). We already proved the VJP of identity, and an additive fan-in collapses to the same product-rule structure as elementwise. Theorem 40 is one line of composition; it introduces no new analytic dependencies.

This is attention, in disguise, before transformers

The mechanism is worth naming clearly: SE looks at the entire feature map, produces a content-dependent re-weighting, and applies it back to the data. That is a learned attention operation in the standard sense. The reduction is over the spatial axes (so each token-equivalent is a channel, not a patch), but the shape is the same: a query (GAP summary) attending over a collection (the channels) to produce weights, weights gating the original signal.

The transformer’s self-attention (Vaswani et al. 2017) generalizes the same pattern from a fixed query (the GAP) to a learned sequence of queries (the token positions) and from a fixed key/value (the channels) to a learned key/value (per-token features). SE is the bare minimum version of that idea. Every post-2018 vision architecture has SE or one of its descendants (CBAM, ECA, GE) bolted in, because the accuracy-per-parameter cost is low and the operation composes cleanly with anything else.

EfficientNet-B0: MobileNet-V2 with SE everywhere

EfficientNet-B0 (Tan & Le 2019, arXiv:1905.11946) is, at the architectural level, MobileNet-V2 with two changes:

  1. Every .mbConv block contains an SE sub-module after the depthwise conv. That’s the useSE = true flag in the spec.

  2. Swish (\(x \cdot \sigma (x)\)) replaces ReLU as the activation. Smoother gradient, modest empirical lift, no new VJP structure— just chain rule through a sigmoid.

The paper’s actual thesis is compound scaling: jointly scale depth, width, and input resolution by a single coefficient \(\phi \) and you get the EfficientNet family B0 through B7, with B7 reaching ImageNet state-of-the-art at the time. At the B0 baseline (the one in this book), the architectural lift over MobileNet-V2 is essentially SE plus Swish; the compound scaling story lives in the B0\(\to \)B7 progression that we don’t fit on Imagenette because the dataset is too small for the return-on-scaling to show up.

The theorem

Theorem 40 SE block VJP
#

SE multiplies input by a sigmoid-gated channel mask.

Proof

Mechanical; see Proofs.seBlock_has_vjp.

8.1 Example: EfficientNet-B0 on Imagenette

Squeeze-and-Excitation by itself is one of the smallest architectural contributions in deep learning. Take a feature map, compute a per-channel scalar by global-average-pooling it, pass those scalars through a tiny two-layer MLP with a sigmoid at the end, multiply those sigmoid outputs back into the original feature map per-channel. That’s it. It’s a learned per-channel reweighting — a particularly simple form of attention, predating the transformer-era use of “attention” by four years (Hu et al. 2018, Squeeze-and- Excitation Networks).

EfficientNet (Tan & Le 2019) bolts SE into every .mbConv block in what’s otherwise a MobileNet-V2-shaped architecture. The paper’s actual contribution is compound scaling (tune depth, width, and input resolution together) but the B0 baseline’s architectural novelty over MobileNet V2 is essentially “add SE blocks.”

The architecture

-- 1
import LeanMlir

-- 2
def efficientNetB0 : NetSpec where
  name   := "EfficientNet-B0"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,                          -- stem 224→112
    .mbConv  32  16 1 3 1 1 true,                    -- 112
    .mbConv  16  24 6 3 2 2 true,                    -- 112→56
    .mbConv  24  40 6 5 2 2 true,                    -- 56→28
    .mbConv  40  80 6 3 2 3 true,                    -- 28→14
    .mbConv  80 112 6 5 1 3 true,                    -- 14
    .mbConv 112 192 6 5 2 4 true,                    -- 14→7
    .mbConv 192 320 6 3 1 1 true,                    -- 7
    .convBn 320 1280 1 1 .same,                      -- 1×1 to 1280
    .globalAvgPool,
    .dense 1280 10 .identity
  ]

-- 3
def efficientNetB0Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  efficientNetB0.train efficientNetB0Config
    (args.head?.getD "data/imagenette")

The .mbConv primitive takes (ic, oc, expandRatio, kernel, stride, nBlocks, useSE). Kernel sizes vary by stage (EfficientNet uses a mix of \(3 \times 3\) and \(5 \times 5\) depthwise convs), and useSE = true in every block here turns on the Squeeze-and-Excitation sub-module. Every SE block multiplies the input by the sigmoid output of its two-layer MLP: that’s exactly § 40 composed with § 11 and the already- proved dense VJPs from Chapter 3. No new math in this chapter beyond the SE block itself.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/efficientnet-train
EfficientNet-B0: 7155658 params
Generating train step MLIR...
  938160 chars
Compiling vmfbs...
  forward compiled
  eval forward compiled
  train step compiled
  session loaded
  train: 9469 images (256×256)
  7155658 params + m + v (81 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 49, BN stat floats: 42016
  step 0/295: loss=2.345573 (956ms)
Epoch 10/80: loss=0.947504 lr=0.000985 (280713ms)
  val accuracy (running BN): 3077/3904 = 78.82%
Epoch 20/80: loss=0.711379 lr=0.000897 (280782ms)
  val accuracy (running BN): 3208/3904 = 82.17%
Epoch 40/80: loss=0.548946 lr=0.000551 (280416ms)
  val accuracy (running BN): 3363/3904 = 86.14%
Epoch 60/80: loss=0.512627 lr=0.000173 (281250ms)
  val accuracy (running BN): 3410/3904 = 87.35%
Epoch 80/80: loss=0.503827 lr=0.000000 (281927ms)
  val accuracy (running BN): 3419/3904 = 87.58%

(Full log in logs/effnet_se.log.)

Final val accuracy 87.58% on Imagenette, wall time about 6.2 hours, loss plateaus at the same \(\sim \)0.50 label-smoothing floor as ResNet-34 and MobileNet V2. Three-chapter comparison, same dataset, same training recipe:

Model

Params

MLIR

Step time

Total

Val acc

ResNet-34

21.29M

518 KB

1400 ms

9.5 h

90.29%

MobileNet V2

2.24M

741 KB

830 ms

5.4 h

87.09%

EfficientNet-B0

7.16M

938 KB

940 ms

6.2 h

87.58%

- EfficientNet-B0 is 3.2\(\times \) the parameters of MobileNet V2 for a 0.5 percentage-point accuracy lift (87.58% vs 87.09%). On a pure params-vs-accuracy curve, SE is an expensive addition — the parameter cost comes from the two-layer MLP in every SE sub-block, and every .mbConv in B0 has one. Seven stages \(\times \) 1–4 blocks each = 16+ SE blocks across the network.

- 938 KB of MLIR is the largest yet — more than either ResNet-34 (518 KB) or MobileNet V2 (741 KB). The extra ops come from SE’s per-channel attention: each SE block emits a GAP, two denses, a sigmoid, and an elementwise-multiply. Over 16 blocks that’s non-trivial StableHLO.

- Per-step time is 940 ms, intermediate between R34 and MNv2. SE’s extra ops slow things down vs plain depthwise-separable, but the core is still depthwise-separable so it remains cheaper than ResNet-34’s full convs.

- The lift isn’t from SE alone. EfficientNet’s paper thesis is that compound scaling — tuning depth, width, and resolution jointly with a scaling coefficient — produces the ImageNet state-of-the-art. At the B0 baseline shown here, SE gives a modest bump; the real wins come at B3–B7 where the compound scaling compounds. Our Imagenette example sits at B0 because the dataset is 10-class Imagenette, not 1000-class ImageNet, and the return-on-scaling saturates much earlier.

Every post-2018 vision architecture has some variant of SE in it (MobileNet V3, EfficientNetV2, ConvNeXt, RegNetZ) because the accuracy-per-parameter cost is low when you compound it with other scaling tricks. The other architectural contribution EfficientNet shipped — the Swish activation \(\mathrm{swish}(x) = x \cdot \sigma (x)\) — is orthogonal to SE.

8.2 Side quests

Two EfficientNet follow-ups extended the family in orthogonal directions but neither lands in this book’s spec. Pointers for the curious:

Noisy Student (Xie et al. 2019). Semi-supervised distillation that pushed EfficientNet-L2 to \(88.4\% \) top-1 on ImageNet, a state-of-the-art mark at the time. The recipe: train a teacher on labeled ImageNet, use the teacher to pseudo-label JFT-300M (\(\sim \)300M unlabeled images), then train a larger student on the combined dataset with heavy augmentation (RandAugment, dropout, stochastic depth) — the “noise” the title refers to. Iterate.

What’s interesting from this book’s perspective: the architecture doesn’t change. Every Noisy Student model is still EfficientNet-B\(k\) or -L\(k\). The lift comes entirely from the data plus training recipe, not new ops. Our enetB0SwishSpec could be the teacher in a Noisy Student loop unchanged — the missing pieces are the unlabeled corpus and the iteration harness, not the proof side.

EfficientDet (Tan et al. 2020). EfficientNet backbone plus BiFPN (bidirectional feature pyramid) for object detection. BiFPN aggregates multi-scale features by fusing each spatial resolution with both its coarser and finer neighbors via weighted-sum cross-scale connections. The detection head is a small per-anchor MLP predicting class plus box coordinates, applied at every BiFPN output level.

What it would take to add: BiFPN is a few rounds of convBn plus resize-to-match-spatial-resolution plus weighted add. The cross-scale resize is the only new primitive — nearest-neighbor or bilinear upsample / downsample. Every other component (conv, BN, dense head) is already proved. The detection-side loss (focal loss plus Smooth-L1 box regression) is also new but it sits at the leaves of the backward graph; it doesn’t touch the proofs threading through the architecture.

These two side quests sit on the bestiary edge: the underlying VJP proofs in Proofs/ would hold without modification; what’s missing is the surrounding scaffolding — a data pipeline for Noisy Student, a multi-scale head plus detection loss for EfficientDet.