7 MobileNetV2
A standard conv does two jobs at once
A standard 3\(\times \)3 convolution from \(C_\text {in}\) channels to \(C_\text {out}\) channels computes, for every output cell:
That sum is doing two things. The \(k_h, k_w\) part is spatial mixing: it combines a pixel with its neighbors. The \(c_\text {in}\) part is cross-channel mixing: it combines the \(C_\text {in}\) channels into each of \(C_\text {out}\) new channels. Both happen inside the same kernel and at the same cost \(C_\text {in} \times C_\text {out} \times k^2\) weights per layer.
Howard et al. 2017 (MobileNetV1, then V2 in arXiv:1801.04381) asked: what if we factored those two responsibilities and only paid the full cost of one of them? The spatial mixing is locally information-rich and benefits from a real \(3 \times 3\) neighborhood. The channel mixing can be approximated by a much cheaper operation. If we separate them, we get parameter efficiency without losing much expressive power.
Depthwise: one kernel per channel, no cross-channel sum
A depthwise convolution does the spatial half on its own, per channel. Each output channel comes from one input channel, with its own \(k \times k\) kernel:
Only one summation level (over the kernel positions), not two. No mixing between channels at all—channel \(c\) of the output only ever looks at channel \(c\) of the input. The weight tensor is shape \([C, k, k]\) instead of \([C_\text {out}, C_\text {in}, k, k]\):
For a typical \(C_\text {in} = C_\text {out} = 64\) layer, that’s 576 weights vs 36,864 weights—a 64\(\times \) reduction. The Jacobian from Chapter 4 simplifies in step: one fewer \(\Sigma \) level means the depthwise input-VJP is a slightly simpler reversed-kernel convolution, the weight VJP is a slightly simpler transpose-trick, and the bias VJP is the same per-channel sum. Theorems 37 through 39 prove the depthwise VJPs.
The depthwise-separable sandwich
A depthwise conv alone never lets channels see each other. That’s a problem—the whole point of stacking convolutions is to build hierarchical features that combine information across channels. The fix is to follow each depthwise with a pointwise \(1 \times 1\) convolution, which is purely cross-channel mixing (no spatial structure—each output pixel is a learned linear combination of the same input pixel’s channels). That two-step combination,
is called a depthwise-separable convolution, and it’s what replaces the standard \(3 \times 3\) conv in MobileNet and everywhere downstream. A standard \(3 \times 3\) over 64\(\to \)64 costs 36,864 weights; the separable version costs 576 (depthwise) + 4,096 (pointwise) = 4,672. Eight times fewer.
Inverted residual: depthwise at the wide point
MobileNetV2 wraps the separable block in one more idea. ResNet’s bottleneck block is wide \(\to \) narrow \(\to \) wide (project down for the expensive spatial conv, project back up for the residual). MobileNetV2 flips it: narrow \(\to \) wide \(\to \) narrow, with the depthwise spatial conv at the wide expansion. The intuition is that depthwise is so cheap that you can afford to make it wider; the narrow bottleneck channels carry the actual information across the residual.
Concretely, an inverted-residual block with input channels \(C_\text {in}\), output channels \(C_\text {out}\), and expansion ratio \(t\):
\(1 \times 1\) conv \(C_\text {in} \to t \cdot C_\text {in}\) (the expand)
\(3 \times 3\) depthwise at \(t \cdot C_\text {in}\) channels (the spatial mix at the wide point)
\(1 \times 1\) conv \(t \cdot C_\text {in} \to C_\text {out}\) (the project, narrowing back down)
Residual skip when \(C_\text {in} = C_\text {out}\) and stride equals 1
This is the .invertedResidual ic oc t stride n primitive in the spec—one Layer constructor that emits all three convs, the BN+ReLU between them, and the conditional residual add. The VJP for the whole block composes the depthwise VJP from this chapter with the standard conv2d VJPs from Chapter 4, the BN VJP from Chapter 5, and the additive fan-in for the residual from Chapter 6. Once again, no new math beyond the foundation rules—just the depthwise primitive plus composition.
Stacking the blocks
MobileNetV2 stacks 17 inverted-residual blocks across seven stages (at depths 1, 2, 3, 4, 3, 3, 1) with widths progressing from 16 to 320 channels, then a final \(1 \times 1\) to 1280 channels, global average pool, and a 1280-to-10 dense head—the same stem-stages-head template as ResNet-34. Total: 2.24M parameters, 9.5\(\times \) fewer than ResNet-34, at a modest accuracy cost (87% vs 90% on Imagenette in the example below). For mobile and embedded deployment that trade is the whole reason MobileNet exists.
The theorems
Concrete per-channel cross-correlation (Phase 7).
Phase 2 (Apr 2026) proof using the same recipe as conv2d_has_vjp3 with one fewer \(\Sigma \) level (no cross-channel mixing in depthwise).
Mechanical; see Proofs.depthwise_has_vjp3.
Phase 7: reuses \(\mathsf{HasVJP3}\) directly since DepthwiseKernel is definitionally \(\mathsf{Tensor3}\). Derived from foundation rules.
Mechanical; see Proofs.depthwise_weight_grad_has_vjp3.
Phase 9. Derived from foundation rules.
Mechanical; see Proofs.depthwise_bias_grad_has_vjp.
7.1 Example: MobileNet V2 on Imagenette
Depthwise convolution on its own is rarely used directly — you almost always see it wrapped in a depthwise-separable sandwich: \(1 \times 1\) expand, \(3 \times 3\) depthwise, \(1 \times 1\) project, with a residual connection around the whole thing. That sandwich is the inverted-residual block that defines MobileNet V2, and it’s how depthwise convolutions got their fame.
The point of depthwise is parameter efficiency. A normal \(3 \times 3\) conv over 64 channels uses \(3 \times 3 \times 64 \times 64 = 36864\) weights. A depthwise version uses \(3 \times 3 \times 64 = 576\) weights. 64\(\times \) fewer. The depthwise-separable block recovers the missing cross-channel expressivity with the surrounding \(1 \times 1\) convs, at a combined cost far below a full \(3 \times 3\). MobileNet V2 is what you get when you stack 17 of these blocks and call it a network.
The architecture
-- 1
import LeanMlir
-- 2
def mobilenetV2 : NetSpec where
name := "MobileNet-v2"
imageH := 224
imageW := 224
layers := [
.convBn 3 32 3 2 .same, -- stem 224→112
.invertedResidual 32 16 1 1 1, -- 112, expand 1×
.invertedResidual 16 24 6 2 2, -- 112→56, t=6
.invertedResidual 24 32 6 2 3, -- 56→28, t=6
.invertedResidual 32 64 6 2 4, -- 28→14, t=6
.invertedResidual 64 96 6 1 3, -- 14, t=6
.invertedResidual 96 160 6 2 3, -- 14→7, t=6
.invertedResidual 160 320 6 1 1, -- 7, t=6
.convBn 320 1280 1 1 .same, -- 1×1 to 1280
.globalAvgPool,
.dense 1280 10 .identity
]
-- 3
def mobilenetV2Config : TrainConfig where
learningRate := 0.001
batchSize := 32
epochs := 80
useAdam := true
weightDecay := 0.0001
cosineDecay := true
warmupEpochs := 3
augment := true
labelSmoothing := 0.1
-- 4
def main (args : List String) : IO Unit :=
mobilenetV2.train mobilenetV2Config (args.head?.getD "data/imagenette")
The .invertedResidual primitive takes (ic, oc, expandRatio, stride, nBlocks). Internally each block is: \(1 \times 1\) conv to ic*expandRatio channels, \(3 \times 3\) depthwise at that width, \(1 \times 1\) project to oc, plus a residual skip when ic == oc and stride == 1. The VJP is § 37 for the depthwise conv composed with the existing dense / BN / biPath theorems we already proved. Training config is identical to ResNet-34’s (Appendix A): Adam + cosine + warmup + wd + augmentation + label smoothing, the same production recipe.
Results
$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
./.lake/build/bin/mobilenet-v2-train
MobileNet-v2: 2236682 params
Generating train step MLIR...
741020 chars
Compiling vmfbs...
forward compiled
eval forward (fixed BN) compiled
compiled
session loaded
train: 9469 images (256×256)
2236682 params + m + v (25 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
cosine, label_smooth=0.1, wd=1e-4
BN layers: 52, BN stat floats: 34112
step 0/295: loss=2.327938 (832ms)
Epoch 1/80: loss=1.968284 lr=0.000333 (243839ms)
Epoch 2/80: loss=(dropping) lr=0.000667 ...
...
Epoch 79/80: loss=0.532640 lr=0.000002 (244123ms)
Epoch 80/80: loss=0.532729 lr=0.000000 (244070ms)
val accuracy (running BN): 3400/3904 = 87.09%
(Full log in logs/mnv2_train.log.)
Final val accuracy 87.09% — 3.2 points below ResNet-34’s 90.29% on the same dataset and training config, at 9.5\(\times \) fewer parameters (2.24M vs 21.29M). That’s the depthwise-separable trade: you give up a couple of accuracy points to get an order-of- magnitude reduction in parameter count. For mobile and embedded deployment, that’s a deal you want.
A few observations worth calling out:
- Per-step time is faster, not slower: 830 ms vs ResNet-34’s 1400 ms. Even though the network has more layers (17 inverted-residual blocks \(\times \) 3 sub-layers each = 51+ internal conv layers, vs ResNet-34’s 34), the depthwise convs are so cheap that overall throughput improves. Fewer parameters and faster training, at a modest accuracy cost.
- MLIR is actually bigger, not smaller: 741 020 chars vs ResNet-34’s 517 912. That’s counterintuitive — the network has fewer params but emits more StableHLO. The reason: each depthwise-separable block has three separate convs (expand, depthwise, project) plus their BN ops, so there are more operations even though each operation has fewer weights. IREE emits per-op, so the char count tracks op count, not param count.
- 52 BN layers, up from ResNet-34’s 36. Same reason — more internal conv layers means more BN-after-conv pairs. Each one still proves its VJP via § 34.
- 9.5\(\times \) fewer parameters, 5.4 hours total training vs ResNet-34’s 9.5 hours. The speedup comes entirely from the smaller per-step time; the epoch count is the same.
The general pattern — deeper + narrower + per-channel conv + surrounding \(1 \times 1\) expansion — carries forward to MobileNet V3 and the EfficientNet family. All of those add their own small modifications on top (SE blocks, Swish activations, compound scaling) but the core depthwise-separable scaffolding is exactly what this chapter formalized.