Verified Deep Learning with Lean 4

7 MobileNetV2

A standard conv does two jobs at once

A standard 3\(\times \)3 convolution from \(C_\text {in}\) channels to \(C_\text {out}\) channels computes, for every output cell:

\[ y_{c_\text {out},\, i,\, j} \; =\; \sum _{c_\text {in},\, k_h,\, k_w} W_{c_\text {out},\, c_\text {in},\, k_h,\, k_w} \cdot x_{c_\text {in},\, i+k_h,\, j+k_w}. \]

That sum is doing two things. The \(k_h, k_w\) part is spatial mixing: it combines a pixel with its neighbors. The \(c_\text {in}\) part is cross-channel mixing: it combines the \(C_\text {in}\) channels into each of \(C_\text {out}\) new channels. Both happen inside the same kernel and at the same cost \(C_\text {in} \times C_\text {out} \times k^2\) weights per layer.

Howard et al. 2017 (MobileNetV1, then V2 in arXiv:1801.04381) asked: what if we factored those two responsibilities and only paid the full cost of one of them? The spatial mixing is locally information-rich and benefits from a real \(3 \times 3\) neighborhood. The channel mixing can be approximated by a much cheaper operation. If we separate them, we get parameter efficiency without losing much expressive power.

Depthwise: one kernel per channel, no cross-channel sum

A depthwise convolution does the spatial half on its own, per channel. Each output channel comes from one input channel, with its own \(k \times k\) kernel:

\[ y_{c,\, i,\, j} \; =\; \sum _{k_h,\, k_w} W_{c,\, k_h,\, k_w} \cdot x_{c,\, i+k_h,\, j+k_w}. \]

Only one summation level (over the kernel positions), not two. No mixing between channels at all—channel \(c\) of the output only ever looks at channel \(c\) of the input. The weight tensor is shape \([C, k, k]\) instead of \([C_\text {out}, C_\text {in}, k, k]\):

\[ \text{depthwise weights: } C \times 3 \times 3 \quad \text{vs}\quad \text{standard weights: } C_\text {out} \times C_\text {in} \times 3 \times 3. \]

For a typical \(C_\text {in} = C_\text {out} = 64\) layer, that’s 576 weights vs 36,864 weights—a 64\(\times \) reduction. The Jacobian from Chapter 4 simplifies in step: one fewer \(\Sigma \) level means the depthwise input-VJP is a slightly simpler reversed-kernel convolution, the weight VJP is a slightly simpler transpose-trick, and the bias VJP is the same per-channel sum. Theorems 37 through 39 prove the depthwise VJPs.

The depthwise-separable sandwich

A depthwise conv alone never lets channels see each other. That’s a problem—the whole point of stacking convolutions is to build hierarchical features that combine information across channels. The fix is to follow each depthwise with a pointwise \(1 \times 1\) convolution, which is purely cross-channel mixing (no spatial structure—each output pixel is a learned linear combination of the same input pixel’s channels). That two-step combination,

\[ \text{depthwise}\ 3 \times 3 \quad \longrightarrow \quad \text{pointwise}\ 1 \times 1, \]

is called a depthwise-separable convolution, and it’s what replaces the standard \(3 \times 3\) conv in MobileNet and everywhere downstream. A standard \(3 \times 3\) over 64\(\to \)64 costs 36,864 weights; the separable version costs 576 (depthwise) + 4,096 (pointwise) = 4,672. Eight times fewer.

Inverted residual: depthwise at the wide point

MobileNetV2 wraps the separable block in one more idea. ResNet’s bottleneck block is wide \(\to \) narrow \(\to \) wide (project down for the expensive spatial conv, project back up for the residual). MobileNetV2 flips it: narrow \(\to \) wide \(\to \) narrow, with the depthwise spatial conv at the wide expansion. The intuition is that depthwise is so cheap that you can afford to make it wider; the narrow bottleneck channels carry the actual information across the residual.

Concretely, an inverted-residual block with input channels \(C_\text {in}\), output channels \(C_\text {out}\), and expansion ratio \(t\):

  1. \(1 \times 1\) conv \(C_\text {in} \to t \cdot C_\text {in}\) (the expand)

  2. \(3 \times 3\) depthwise at \(t \cdot C_\text {in}\) channels (the spatial mix at the wide point)

  3. \(1 \times 1\) conv \(t \cdot C_\text {in} \to C_\text {out}\) (the project, narrowing back down)

  4. Residual skip when \(C_\text {in} = C_\text {out}\) and stride equals 1

This is the .invertedResidual ic oc t stride n primitive in the spec—one Layer constructor that emits all three convs, the BN+ReLU between them, and the conditional residual add. The VJP for the whole block composes the depthwise VJP from this chapter with the standard conv2d VJPs from Chapter 4, the BN VJP from Chapter 5, and the additive fan-in for the residual from Chapter 6. Once again, no new math beyond the foundation rules—just the depthwise primitive plus composition.

Stacking the blocks

MobileNetV2 stacks 17 inverted-residual blocks across seven stages (at depths 1, 2, 3, 4, 3, 3, 1) with widths progressing from 16 to 320 channels, then a final \(1 \times 1\) to 1280 channels, global average pool, and a 1280-to-10 dense head—the same stem-stages-head template as ResNet-34. Total: 2.24M parameters, 9.5\(\times \) fewer than ResNet-34, at a modest accuracy cost (87% vs 90% on Imagenette in the example below). For mobile and embedded deployment that trade is the whole reason MobileNet exists.

The theorems

Definition 36 Depthwise conv forward
#

Concrete per-channel cross-correlation (Phase 7).

Theorem 37 Depthwise input VJP
#

Phase 2 (Apr 2026) proof using the same recipe as conv2d_has_vjp3 with one fewer \(\Sigma \) level (no cross-channel mixing in depthwise).

Proof

Mechanical; see Proofs.depthwise_has_vjp3.

Theorem 38 Depthwise weight VJP

Phase 7: reuses \(\mathsf{HasVJP3}\) directly since DepthwiseKernel is definitionally \(\mathsf{Tensor3}\). Derived from foundation rules.

Proof
Theorem 39 Depthwise bias VJP

Phase 9. Derived from foundation rules.

Proof

7.1 Example: MobileNet V2 on Imagenette

Depthwise convolution on its own is rarely used directly — you almost always see it wrapped in a depthwise-separable sandwich: \(1 \times 1\) expand, \(3 \times 3\) depthwise, \(1 \times 1\) project, with a residual connection around the whole thing. That sandwich is the inverted-residual block that defines MobileNet V2, and it’s how depthwise convolutions got their fame.

The point of depthwise is parameter efficiency. A normal \(3 \times 3\) conv over 64 channels uses \(3 \times 3 \times 64 \times 64 = 36864\) weights. A depthwise version uses \(3 \times 3 \times 64 = 576\) weights. 64\(\times \) fewer. The depthwise-separable block recovers the missing cross-channel expressivity with the surrounding \(1 \times 1\) convs, at a combined cost far below a full \(3 \times 3\). MobileNet V2 is what you get when you stack 17 of these blocks and call it a network.

The architecture

-- 1
import LeanMlir

-- 2
def mobilenetV2 : NetSpec where
  name   := "MobileNet-v2"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,                    -- stem 224→112
    .invertedResidual  32  16 1 1 1,           -- 112, expand 1×
    .invertedResidual  16  24 6 2 2,           -- 112→56, t=6
    .invertedResidual  24  32 6 2 3,           -- 56→28, t=6
    .invertedResidual  32  64 6 2 4,           -- 28→14, t=6
    .invertedResidual  64  96 6 1 3,           -- 14, t=6
    .invertedResidual  96 160 6 2 3,           -- 14→7, t=6
    .invertedResidual 160 320 6 1 1,           -- 7, t=6
    .convBn 320 1280 1 1 .same,                -- 1×1 to 1280
    .globalAvgPool,
    .dense 1280 10 .identity
  ]

-- 3
def mobilenetV2Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  mobilenetV2.train mobilenetV2Config (args.head?.getD "data/imagenette")

The .invertedResidual primitive takes (ic, oc, expandRatio, stride, nBlocks). Internally each block is: \(1 \times 1\) conv to ic*expandRatio channels, \(3 \times 3\) depthwise at that width, \(1 \times 1\) project to oc, plus a residual skip when ic == oc and stride == 1. The VJP is § 37 for the depthwise conv composed with the existing dense / BN / biPath theorems we already proved. Training config is identical to ResNet-34’s (Appendix A): Adam + cosine + warmup + wd + augmentation + label smoothing, the same production recipe.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/mobilenet-v2-train
MobileNet-v2: 2236682 params
Generating train step MLIR...
  741020 chars
Compiling vmfbs...
  forward compiled
  eval forward (fixed BN) compiled
  compiled
  session loaded
  train: 9469 images (256×256)
  2236682 params + m + v (25 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 52, BN stat floats: 34112
  step 0/295: loss=2.327938 (832ms)
Epoch  1/80: loss=1.968284 lr=0.000333 (243839ms)
Epoch  2/80: loss=(dropping) lr=0.000667 ...
...
Epoch 79/80: loss=0.532640 lr=0.000002 (244123ms)
Epoch 80/80: loss=0.532729 lr=0.000000 (244070ms)
  val accuracy (running BN): 3400/3904 = 87.09%

(Full log in logs/mnv2_train.log.)

Final val accuracy 87.09% — 3.2 points below ResNet-34’s 90.29% on the same dataset and training config, at 9.5\(\times \) fewer parameters (2.24M vs 21.29M). That’s the depthwise-separable trade: you give up a couple of accuracy points to get an order-of- magnitude reduction in parameter count. For mobile and embedded deployment, that’s a deal you want.

A few observations worth calling out:

- Per-step time is faster, not slower: 830 ms vs ResNet-34’s 1400 ms. Even though the network has more layers (17 inverted-residual blocks \(\times \) 3 sub-layers each = 51+ internal conv layers, vs ResNet-34’s 34), the depthwise convs are so cheap that overall throughput improves. Fewer parameters and faster training, at a modest accuracy cost.

- MLIR is actually bigger, not smaller: 741 020 chars vs ResNet-34’s 517 912. That’s counterintuitive — the network has fewer params but emits more StableHLO. The reason: each depthwise-separable block has three separate convs (expand, depthwise, project) plus their BN ops, so there are more operations even though each operation has fewer weights. IREE emits per-op, so the char count tracks op count, not param count.

- 52 BN layers, up from ResNet-34’s 36. Same reason — more internal conv layers means more BN-after-conv pairs. Each one still proves its VJP via § 34.

- 9.5\(\times \) fewer parameters, 5.4 hours total training vs ResNet-34’s 9.5 hours. The speedup comes entirely from the smaller per-step time; the epoch count is the same.

The general pattern — deeper + narrower + per-channel conv + surrounding \(1 \times 1\) expansion — carries forward to MobileNet V3 and the EfficientNet family. All of those add their own small modifications on top (SE blocks, Swish activations, compound scaling) but the core depthwise-separable scaffolding is exactly what this chapter formalized.