6 MobileNetV2

A standard conv does two jobs at once

A standard 3$\times $3 convolution from $C_\text {in}$ channels to $C_\text {out}$ channels computes, for every output cell:

\[ y_{c_\text {out},\, i,\, j} \; =\; \sum _{c_\text {in},\, k_h,\, k_w} W_{c_\text {out},\, c_\text {in},\, k_h,\, k_w} \cdot x_{c_\text {in},\, i+k_h,\, j+k_w}. \]

That sum is doing two things. The $k_h, k_w$ part is spatial mixing: it combines a pixel with its neighbors. The $c_\text {in}$ part is cross-channel mixing: it combines the $C_\text {in}$ channels into each of $C_\text {out}$ new channels. Both happen inside the same kernel and at the same cost $C_\text {in} \times C_\text {out} \times k^2$ weights per layer.

Howard et al. 2017 (MobileNetV1, then V2 in arXiv:1801.04381) asked: what if we factored those two responsibilities and only paid the full cost of one of them? The spatial mixing is locally information-rich and benefits from a real $3 \times 3$ neighborhood. The channel mixing can be approximated by a much cheaper operation. If we separate them, we get parameter efficiency without losing much expressive power.

Depthwise: one kernel per channel, no cross-channel sum

A depthwise convolution does the spatial half on its own, per channel. Each output channel comes from one input channel, with its own $k \times k$ kernel:

\[ y_{c,\, i,\, j} \; =\; \sum _{k_h,\, k_w} W_{c,\, k_h,\, k_w} \cdot x_{c,\, i+k_h,\, j+k_w}. \]

Only one summation level (over the kernel positions), not two. No mixing between channels at all—channel $c$ of the output only ever looks at channel $c$ of the input. The weight tensor is shape $[C, k, k]$ instead of $[C_\text {out}, C_\text {in}, k, k]$:

\[ \text{depthwise weights: } C \times 3 \times 3 \quad \text{vs}\quad \text{standard weights: } C_\text {out} \times C_\text {in} \times 3 \times 3. \]

For a typical $C_\text {in} = C_\text {out} = 64$ layer, that’s 576 weights vs 36,864 weights—a 64$\times $ reduction. The Jacobian from Chapter 3 simplifies in step: one fewer $\Sigma $ level means the depthwise input-VJP is a slightly simpler reversed-kernel convolution, the weight VJP is a slightly simpler transpose-trick, and the bias VJP is the same per-channel sum. Theorems 39 through 41 prove the depthwise VJPs.

The depthwise-separable sandwich

A depthwise conv alone never lets channels see each other. That’s a problem—the whole point of stacking convolutions is to build hierarchical features that combine information across channels. The fix is to follow each depthwise with a pointwise $1 \times 1$ convolution, which is purely cross-channel mixing (no spatial structure—each output pixel is a learned linear combination of the same input pixel’s channels). That two-step combination,

\[ \text{depthwise}\ 3 \times 3 \quad \longrightarrow \quad \text{pointwise}\ 1 \times 1, \]

is called a depthwise-separable convolution, and it’s what replaces the standard $3 \times 3$ conv in MobileNet and everywhere downstream. A standard $3 \times 3$ over 64$\to $64 costs 36,864 weights; the separable version costs 576 (depthwise) + 4,096 (pointwise) = 4,672. Eight times fewer.

Inverted residual: depthwise at the wide point

MobileNetV2 wraps the separable block in one more idea. ResNet’s bottleneck block is wide $\to $ narrow $\to $ wide (project down for the expensive spatial conv, project back up for the residual). MobileNetV2 flips it: narrow $\to $ wide $\to $ narrow, with the depthwise spatial conv at the wide expansion. The intuition is that depthwise is so cheap that you can afford to make it wider; the narrow bottleneck channels carry the actual information across the residual.

Concretely, an inverted-residual block with input channels $C_\text {in}$, output channels $C_\text {out}$, and expansion ratio $t$:

$1 \times 1$ conv $C_\text {in} \to t \cdot C_\text {in}$ (the expand)
$3 \times 3$ depthwise at $t \cdot C_\text {in}$ channels (the spatial mix at the wide point)
$1 \times 1$ conv $t \cdot C_\text {in} \to C_\text {out}$ (the project, narrowing back down)
Residual skip when $C_\text {in} = C_\text {out}$ and stride equals 1

This is the .invertedResidual ic oc t stride n primitive in the spec—one Layer constructor that emits all three convs, the BN+ReLU between them, and the conditional residual add. The VJP for the whole block composes the depthwise VJP from this chapter with the standard conv2d VJPs from Chapter 3, the BN VJP from Chapter 4, and the additive fan-in for the residual from Chapter 5. Once again, no new math beyond the foundation rules—just the depthwise primitive plus composition.

Stacking the blocks

MobileNetV2 stacks 17 inverted-residual blocks across seven stages (at depths 1, 2, 3, 4, 3, 3, 1) with widths progressing from 16 to 320 channels, then a final $1 \times 1$ to 1280 channels, global average pool, and a 1280-to-10 dense head—the same stem-stages-head template as ResNet-34. Total: 2.24M parameters, 9.5$\times $ fewer than ResNet-34, at a modest accuracy cost (87% vs 90% on Imagenette in the example below). For mobile and embedded deployment that trade is the whole reason MobileNet exists.

6.1 The theorems

Definition 38 Depthwise conv forward

✓

Concrete per-channel cross-correlation (Phase 7).

Theorem 39 Depthwise input VJP

✓

Phase 2 (Apr 2026). prove: $\mathsf{HasVJP3}\, \bigl(\mathrm{depthwiseConv2d}(W, b)\bigr)$, with the per-channel reversed-kernel backward $dx_{c, h, w} = \sum _{k_h, k_w} W_{c,\, k_H - 1 - k_h,\, k_W - 1 - k_w} \cdot dy_{c,\, h + k_h - p,\, w + k_w - p}$.

Proof ▶

Sketch: the conv2d input VJP (Theorem 25) with one fewer $\Sigma $ level — depthwise has no cross-channel mixing, so the channel of every $v$-read is forced to equal the output channel.

Define $B(x, dy)_{c, h_i, w_i} := \sum _{h_o, w_o} [\text{pad-valid}]\; W_{c, \hat{k}_h, \hat{k}_w} \cdot dy_{c, h_o, w_o}$ with reconstructed offsets $\hat{k}_h = h_i + p_H - h_o$, $\hat{k}_w = w_i + p_W - w_o$ — same shape as the conv2d formula, minus the channel sum.
suffices: $B$ matches the $\operatorname {pdiv}_3$ contraction at every index.
proof: Definition 24.
The flattened forward is affine in $v$: constant bias plus $\sum _{k_h, k_w} W_{o(\mathrm{idx}), k_h, k_w} \cdot (\text{pad-conditional projection of } v)$, where the projection reads channel $o(\mathrm{idx})$ itself.
proof: Unfold (Definition 38); steps and rules exactly as in Theorem 25 step 4 (Theorems 3, 6, 8 $\times 2$, 4, 7), with a double instead of triple sum.
Collapse: the channel component of the indicator forces $c_o = c_i$ (channel mismatch zeroes every other term); per $(h_o, w_o)$ the remaining two-conjunct indicator ($k_h + h_o = h_i + p_H$, $k_w + w_o = w_i + p_W$) collapses the $(k_h, k_w)$ sum.
proof: Finset.sum_eq_single on the decoded components.
q.e.d.
proof: What remains is $B(x, dy)_{c_i, h_i, w_i}$ (equivalent to the reversed-kernel form under $k_h \leftrightarrow k_H - 1 - k_h$); with 2, done.

Theorem 40 Depthwise weight VJP

✓

Phase 7. DepthwiseKernel is definitionally $\mathsf{Tensor3}$, so $\mathsf{HasVJP3}$ applies to the kernel directly — no flatten bijection needed, unlike the conv2d weight VJP (Theorem 26). prove: $\mathsf{HasVJP3}\, \bigl(W \mapsto \mathrm{depthwiseConv2d}(W, b)\, x\bigr)$, with the per-channel transpose-trick backward $dW_{c, k_h, k_w} = \sum _{h_o, w_o} [\text{pad-valid}]\; x_{c,\, k_h + h_o - p_H,\, k_w + w_o - p_W} \cdot dy_{c, h_o, w_o}$.

Proof ▶

Define $B(W, dy)_{c, k_h, k_w}$ as the formula above.
suffices: $B$ matches the $\operatorname {pdiv}_3$ contraction at every kernel index.
proof: Definition 24, applicable since the kernel is a $\mathsf{Tensor3}$.
Per-$(c_o, h_o, w_o)$ Jacobian: $\operatorname {pdiv}_3 = [c_o = c]\, \cdot (\text{pad-conditional } x \text{ term})$ — a single channel equality where the conv2d weight VJP had a packed $\varphi $-comparison.
proof: At output $(c_o, h_o, w_o)$ the forward is $b_{c_o} + \sum _{k_h, k_w} W_{c_o, k_h, k_w} \cdot x\text{-pad-term}$: affine decomposition and the same foundation-rule sequence as Theorem 26 step 4 (sum + constant + finite-sum $\times 2$ + product + reindex rules).
q.e.d.
proof: Contract 3 with $dy$: collapse the channel sum at $c_o = c$ (Finset.sum_eq_single); the spatial sums survive, leaving exactly $B$; with 2, done.

Theorem 41 Depthwise bias VJP

✓

Phase 9. prove: $\mathsf{HasVJP}\, \bigl(b \mapsto \mathrm{flatten} (\mathrm{depthwiseConv2d}(W, b)\, x)\bigr)$, with backward $db_c = \sum _{h_i, w_i} dy_{\mathrm{flat}(c, h_i, w_i)}$.

Proof ▶

Same three steps as the conv2d bias VJP (Theorem 27), with input channel $=$ output channel throughout:

As a function of $b$, the flattened forward is $(\text{channel reindex of } b) + (\text{$W\! ,x$ term constant in } b)$, so $\operatorname {pdiv}= \delta _{c,\, \mathrm{chan}(\mathrm{idx})}$.
proof: Sum rule (Theorem 3), reindex Jacobian (Theorem 7), constant rule (Theorem 6).
suffices: $db_c = \sum _{\mathrm{idx}} \operatorname {pdiv}\cdot dy_{\mathrm{idx}}$.
proof: Definition 9.
q.e.d.
proof: Contract 1 with $dy$; the Kronecker keeps channel $c$’s spatial positions (Finset.sum_eq_single).

6.2 Example: MobileNet V2 on Imagenette

Depthwise convolution on its own is rarely used directly — you almost always see it wrapped in a depthwise-separable sandwich: $1 \times 1$ expand, $3 \times 3$ depthwise, $1 \times 1$ project, with a residual connection around the whole thing. That sandwich is the inverted-residual block that defines MobileNet V2, and it’s how depthwise convolutions got their fame.

The point of depthwise is parameter efficiency. A normal $3 \times 3$ conv over 64 channels uses $3 \times 3 \times 64 \times 64 = 36864$ weights. A depthwise version uses $3 \times 3 \times 64 = 576$ weights. 64$\times $ fewer. The depthwise-separable block recovers the missing cross-channel expressivity with the surrounding $1 \times 1$ convs, at a combined cost far below a full $3 \times 3$. MobileNet V2 is 17 of these blocks together.

The architecture

The same vertical column as ResNet-34, now built from inverted-residual blocks: a $3 \times 3$ stride-2 stem, seven stages of .invertedResidual blocks (17 in all, widths $16$ to $320$, expansion $t{=}6$ except the first), a $1 \times 1$ lift to 1280 channels, global average pool, and a dense head — and no max-pools, since every downsample is folded into a strided block. The inset shows one inverted-residual block: the cheap depthwise spatial mix happens at the wide point.

$\begin{tikzpicture} [ >={Stealth[length=1.8mm]}, every node/.style={font=\sffamily\scriptsize}, col/.style = {align=center, rounded corners=2pt, inner sep=2pt, minimum height=0.56cm, minimum width=6.3cm}, io/.style = {col, draw=blue!55!black, fill=blue!8}, convbn/.style = {col, draw=orange!65!black, fill=orange!12}, invres/.style = {col, draw=teal!65!black, fill=teal!10, minimum height=0.6cm}, gap/.style = {col, draw=purple!60!black, fill=purple!8}, head/.style = {col, draw=red!60!black, fill=red!8}, logits/.style = {col, draw=green!50!black, fill=green!14, very thick}, arr/.style = {->, thick, gray!60, shorten >=1pt, shorten <=1pt}, stage/.style = {font=\sffamily\scriptsize\itshape, gray!55!black, anchor=west}, ] \node[io] (input) {Input \;\; $224\times224\times3$}; \node[convbn, below=0.16cm of input] (stem) {\textbf{ConvBN} stem $3\to32$, $3\times3$, /2 \;\; $112{\times}112$}; \node[invres, below=0.16cm of stem] (s1) {$1\times$ \textbf{InvRes} $32\to16$, $t{=}1$ \;\; $112{\times}112$}; \node[invres, below=0.16cm of s1] (s2) {$2\times$ \textbf{InvRes} $16\to24$, $t{=}6$, /2 \;\; $56{\times}56$}; \node[invres, below=0.16cm of s2] (s3) {$3\times$ \textbf{InvRes} $24\to32$, $t{=}6$, /2 \;\; $28{\times}28$}; \node[invres, below=0.16cm of s3] (s4) {$4\times$ \textbf{InvRes} $32\to64$, $t{=}6$, /2 \;\; $14{\times}14$}; \node[invres, below=0.16cm of s4] (s5) {$3\times$ \textbf{InvRes} $64\to96$, $t{=}6$ \;\; $14{\times}14$}; \node[invres, below=0.16cm of s5] (s6) {$3\times$ \textbf{InvRes} $96\to160$, $t{=}6$, /2 \;\; $7{\times}7$}; \node[invres, below=0.16cm of s6] (s7) {$1\times$ \textbf{InvRes} $160\to320$, $t{=}6$ \;\; $7{\times}7$}; \node[convbn, below=0.16cm of s7] (pw) {\textbf{ConvBN} $1\times1$ $320\to1280$ \;\; $7{\times}7$}; \node[gap, below=0.16cm of pw] (g) {global avg pool \;\; $7\times7\times1280 \to 1280$}; \node[head, below=0.16cm of g] (d) {\textbf{Dense} $1280\to10$ \;(identity)}; \node[logits, below=0.16cm of d] (out) {Logits \;\; 10 classes, softmax-CE}; \foreach \a/\b in {input/stem, stem/s1, s1/s2, s2/s3, s3/s4, s4/s5, s5/s6, s6/s7, s7/pw, pw/g, g/d, d/out} \draw[arr] (\a) -- (\b); \foreach \n/\k in {s1/1, s2/2, s3/3, s4/4, s5/5, s6/6, s7/7} \node[stage] at ($(\n.east) + (0.30,0)$) {stage \k}; \end{tikzpicture}$

$\begin{tikzpicture} [ >={Stealth[length=1.6mm]}, every node/.style={font=\sffamily\scriptsize}, box/.style = {align=center, rounded corners=2pt, inner sep=3pt, minimum height=0.62cm, draw=teal!70!black, fill=teal!10}, dot/.style = {circle, draw=teal!70!black, fill=teal!14, inner sep=0pt, minimum size=0.42cm}, term/.style = {font=\sffamily\scriptsize\itshape, inner sep=1pt}, arr/.style = {->, thick, gray!65, shorten >=1pt, shorten <=1pt}, skip/.style = {->, thick, teal!70!black, shorten >=1pt}, ] \node[term] (x) {$x$}; \node[box, right=0.5cm of x] (e) {$1\times1$ expand\\$C\!\to\!tC$, BN, ReLU6}; \node[box, right=0.4cm of e] (dw) {$3\times3$ DW\\BN, ReLU6}; \node[box, right=0.4cm of dw] (p) {$1\times1$ project\\BN (linear)}; \node[dot, right=0.5cm of p] (sum) {$+$}; \node[term, right=0.5cm of sum] (y) {$y$}; \draw[arr] (x) -- (e); \draw[arr] (e) -- (dw); \draw[arr] (dw) -- (p); \draw[arr] (p) -- (sum); \draw[arr] (sum) -- (y); \draw[skip] (x.north) .. controls +(0,0.85) and +(0,0.85) .. (sum.north) node[midway, above, term, teal!70!black] {skip if stride 1, $C_{\text{in}}{=}C_{\text{out}}$}; \node[term, below=0.16cm of dw, gray!55!black] {one inverted-residual block: depthwise spatial mix at the wide point}; \end{tikzpicture}$

-- 1
import LeanMlir

-- 2
def mobilenetV2 : NetSpec where
  name   := "MobileNet-v2"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,                    -- stem 224→112
    .invertedResidual  32  16 1 1 1,           -- 112, expand 1×
    .invertedResidual  16  24 6 2 2,           -- 112→56, t=6
    .invertedResidual  24  32 6 2 3,           -- 56→28, t=6
    .invertedResidual  32  64 6 2 4,           -- 28→14, t=6
    .invertedResidual  64  96 6 1 3,           -- 14, t=6
    .invertedResidual  96 160 6 2 3,           -- 14→7, t=6
    .invertedResidual 160 320 6 1 1,           -- 7, t=6
    .convBn 320 1280 1 1 .same,                -- 1×1 to 1280
    .globalAvgPool,
    .dense 1280 10 .identity
  ]

-- 3
def mobilenetV2Config : TrainConfig where
  learningRate   := 0.001
  batchSize      := 32
  epochs         := 80
  useAdam        := true
  weightDecay    := 0.0001
  cosineDecay    := true
  warmupEpochs   := 3
  augment        := true
  labelSmoothing := 0.1

-- 4
def main (args : List String) : IO Unit :=
  mobilenetV2.train mobilenetV2Config (args.head?.getD "data/imagenette")

The .invertedResidual primitive takes (ic, oc, expandRatio, stride, nBlocks). Internally each block is: $1 \times 1$ conv to ic*expandRatio channels, $3 \times 3$ depthwise at that width, $1 \times 1$ project to oc, plus a residual skip when ic == oc and stride == 1. The VJP is § 39 for the depthwise conv composed with the existing dense / BN / biPath theorems we already proved. Training config is identical to ResNet-34’s (Appendix B): Adam + cosine + warmup + wd + augmentation + label smoothing, the same production recipe.

Results

$ IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
    ./.lake/build/bin/mobilenet-v2-train
MobileNet-v2: 2236682 params
Generating train step MLIR...
  741020 chars
Compiling vmfbs...
  forward compiled
  eval forward (fixed BN) compiled
  compiled
  session loaded
  train: 9469 images (256×256)
  2236682 params + m + v (25 MB)
training: 295 batches/epoch, batch=32, Adam, lr=0.001000,
          cosine, label_smooth=0.1, wd=1e-4
  BN layers: 52, BN stat floats: 34112
  step 0/295: loss=2.327938 (832ms)
Epoch  1/80: loss=1.968284 lr=0.000333 (243839ms)
Epoch  2/80: loss=(dropping) lr=0.000667 ...
...
Epoch 79/80: loss=0.532640 lr=0.000002 (244123ms)
Epoch 80/80: loss=0.532729 lr=0.000000 (244070ms)
  val accuracy (running BN): 3400/3904 = 87.09%

(Full log in logs/mnv2_train.log.)

Final val accuracy 87.09% — 3.2 points below ResNet-34’s 90.29% on the same dataset and training config, at 9.5$\times $ fewer parameters (2.24M vs 21.29M). That’s the depthwise-separable trade: you give up a couple of accuracy points to get an order-of- magnitude reduction in parameter count.

- Per-step time is faster, not slower: 830 ms vs ResNet-34’s 1400 ms. Even though the network has more layers (17 inverted-residual blocks $\times $ 3 sub-layers each = 51+ internal conv layers, vs ResNet-34’s 34), the depthwise convs are so cheap that overall throughput improves. Fewer parameters and faster training, at a modest accuracy cost.

- MLIR is actually bigger, not smaller: 741 020 chars vs ResNet-34’s 517 912. That’s counterintuitive — the network has fewer params but emits more StableHLO. The reason: each depthwise-separable block has three separate convs (expand, depthwise, project) plus their BN ops, so there are more operations even though each operation has fewer weights. IREE emits per-op, so the char count tracks op count, not param count.

- 52 BN layers, up from ResNet-34’s 36. Same reason — more internal conv layers means more BN-after-conv pairs. Each one still proves its VJP via § 36.

- 9.5$\times $ fewer parameters, 5.4 hours total training vs ResNet-34’s 9.5 hours. The speedup comes entirely from the smaller per-step time; the epoch count is the same.

6.3 MLIR: Depthwise Convolution

What is already proven. A depthwise convolution applies one $k\times k$ kernel per channel with no cross-channel sum, so its reverse-mode derivative (§ 39, depthwise_has_vjp3) is a reversed-kernel convolution minus the channel transpose the full-conv VJP of Chapter 3 needs: with no $\sum _{c_\text {in}}$ to take the adjoint of, there is no in/out channel axis to swap. The inverted-residual block composes this with the pointwise convs, BatchNorm, and relu6; mobilenetv2_has_vjp_at chains the whole network through vjp_comp_at.

The gap and how we close it. The emitted graph is denoted and shown equal to depthwise_has_vjp3’s backward. Here is what the printer emits for the depthwise input gradient (two channels, $3\times 3$):

func.func @dw_back(%dy: tensor<1x2x4x4xf32>, %W: tensor<2x3x3xf32>)
    -> tensor<1x2x4x4xf32> {
  %We = stablehlo.reshape %W
          : (tensor<2x3x3xf32>) -> tensor<2x1x3x3xf32>
  %Wr = stablehlo.reverse %We, dims = [2, 3] : tensor<2x1x3x3xf32>
  %dx = stablehlo.convolution(%dy, %Wr)
      dim_numbers = [b, f, 0, 1]x[o, i, 0, 1]->[b, f, 0, 1],
      window = {stride = [1, 1], pad = [[1, 1], [1, 1]],
                lhs_dilate = [1, 1], rhs_dilate = [1, 1]}
      {batch_group_count = 1 : i64, feature_group_count = 2 : i64}
      : (tensor<1x2x4x4xf32>, tensor<2x1x3x3xf32>)
        -> tensor<1x2x4x4xf32>
  return %dx : tensor<1x2x4x4xf32>
}

Read it against Chapter 3’s conv_back: that one began with a transpose %W, dims = [1, 0, 2, 3] to swap the in- and out-channel axes before reversing; this one has no transpose. The reshape only adds the singleton input-channel axis the grouped form expects ($[c,k,k] \to [c,1,k,k]$), the reverse flips the two spatial axes, and feature_group_count = 2 — equal to the channel count — tells the convolution to run each channel through its own filter, no mixing. That single attribute is the depthwise structure; the bridge theorem is the claim that this graph computes depthwise_has_vjp3’s backward.

Caveats.

The depthwise convolution bridge is unconditional — it is linear.
relu6 is a smooth-point bridge — MobileNetV2’s clamp to $[0,6]$ has no derivative where a pre-activation sits exactly at $0$ or $6$, and the bridge may fail only on that measure-zero set.
Representative scale (two channels at $4\times 4$).

6.4 ImageNet recipe

Why the phase-2 trainer? The ImageNet runs in this book use the phase-2 (Lean$\to $JAX) trainer: at this scale its job is to validate the framework’s logic end to end. Whether the phase-3 verified-IREE codegen can reach ImageNet under its own codegen rules is an open question — phase-2 is how we establish these baselines in the meantime.

The Imagenette run above trained on $\sim $9.5K images. The same backbone scales to full 1000-class ImageNet exactly the way ResNet-34 did in Chapter 5: the inverted-residual stack is unchanged, only the head widens to 1000 and the dataset and schedule grow. The spec mirrors jax/MainMobilenetV2Imagenet.lean:

-- Same inverted-residual backbone, 1000 classes instead of 10.
def mobilenetV2Imagenet : NetSpec where
  name   := "MobileNetV2 (ImageNet, bf16)"
  imageH := 224
  imageW := 224
  layers := [
    .convBn 3 32 3 2 .same,
    .invertedResidual  32  16 1 1 1,
    .invertedResidual  16  24 6 2 2,
    .invertedResidual  24  32 6 2 3,
    .invertedResidual  32  64 6 2 4,
    .invertedResidual  64  96 6 1 3,
    .invertedResidual  96 160 6 2 3,
    .invertedResidual 160 320 6 1 1,
    .convBn 320 1280 1 1 .same,
    .globalAvgPool,
    .dense 1280 1000 .identity      -- 1000-class head
  ]

-- RMSProp recipe, MobileNet-native: smaller weight decay, bf16 conv on.
def mobilenetV2ImagenetConfig : TrainConfig where
  learningRate   := 0.045           -- MobileNetV2-native RMSProp peak (was 0.1 for SGD)
  batchSize      := 256
  epochs         := 90              -- the real run; validate at 30 first
  optimizer      := .rmsprop        -- paper: RMSProp + momentum, not SGD
  momentum       := 0.9             -- mu for the RMSProp momentum buffer
  rmspropDecay   := 0.9             -- rho, the running mean-square decay
  rmspropEps     := 1.0             -- MobileNetV2 uses epsilon = 1.0
  weightDecay    := 4e-5            -- smaller than R34's 1e-4
  cosineDecay    := true
  warmupEpochs   := 5
  augment        := true            -- random-crop + horizontal flip
  useAutoAugment := false           -- MNv2 paper used crop/flip only, no AA
  labelSmoothing := 0.1
  bf16           := true
  bf16Conv       := true            -- reaches the inverted-residual blocks

The one knob specific to MobileNet is bf16Conv. ResNet’s convolutions cast to bfloat16 the instant the flag is on, but MobileNet’s compute lives inside inverted-residual blocks — a $1\times 1$ expansion, a depthwise $3\times 3$, a $1\times 1$ projection — whose convolutions originally ran in fp32 regardless. Routing them through the same convdt cast as the plain convolutions is what lets bf16 reach the bulk of the network: the payoff is the whole inverted-residual block running $\sim 2\times $ faster on cuDNN, where the $1\times 1$s (which are really matmuls) love bfloat16 and the depthwise $3\times 3$ is a wash. The weight decay also drops to $4\times 10^{-5}$ — MobileNet’s depthwise filters are tiny ($9$ weights per channel), and the $10^{-4}$ that suits ResNet over-regularizes them.

Compute budget. On six RTX 4060 Ti (CUDA, bf16, batch $252$ across the six GPUs), steady-state throughput is $\sim $89 ms per step, about $7.5$ minutes per epoch — so the full 90-epoch schedule completed in roughly 11 wall-clock hours. Even at this throughput the wall-clock win over the $9.5\times $-larger ResNet-34 is modest: depthwise convolution is memory-bound and low-arithmetic-intensity, so MobileNet’s order-of-magnitude FLOP saving does not translate into a proportional wall-clock advantage on this hardware.

GPU	Precision	Per epoch	90 epochs	Val top-1	Val top-5
6$\times $ 4060 Ti (CUDA)	bf16	$\sim $7.5 min	$\sim $11 hr	$\mathbf{68.77\% }$	$\mathbf{88.53\% }$

The full 90-epoch run — now on MobileNetV2’s native RMSProp ($\rho =0.9$, $\mu =0.9$, $\varepsilon =1.0$) with crop/flip augmentation, the paper recipe — completed at $\mathbf{68.77\% }$ top-1 / $\mathbf{88.53\% }$ top-5 on the full 50,000-image validation split (EMA weights). That is only $+0.44$ over the earlier SGD run at the same schedule ($68.33\% $ / $88.17\% $): swapping in the paper’s optimizer barely moved the needle, so the remaining few points to the $\sim $71% paper number are the schedule — the paper trains far longer with RMSProp’s exponential-decay LR and a heavier augmentation stack — not the optimizer, precision, or pipeline. The RMSProp peak LR of $0.045$ took off cleanly with no collapse: the large $\varepsilon =1.0$ keeps the adaptive step well-damped (in sharp contrast to EfficientNet’s $\varepsilon =10^{-3}$ in Chapter 7, which is far more LR-sensitive), so the $\sim $0.02 fallback was never needed. The validation curve has the usual shape — a fast warmup climb, a long middle, and a final lift as the cosine schedule anneals to zero:

$\begin{tikzpicture} \begin{axis}[ width=0.92\linewidth, height=6.5cm, xlabel={Epoch}, ylabel={Validation accuracy (\%)}, xmin=0, xmax=91, ymin=0, ymax=95, xtick={0,15,30,45,60,75,90}, ytick={20,40,60,80}, legend pos=south east, legend cell align={left}, grid=major, grid style={gray!18}, tick label style={font=\small}, label style={font=\small}, every axis plot/.append style={line width=1pt, mark size=1pt}, ] \addplot[blue, mark=*, mark options={fill=blue}] coordinates { (5,34.43) (6,39.02) (7,42.76) (8,45.35) (13,52.59) (14,53.58) (15,53.83) (16,54.63) (21,56.73) (22,57.33) (23,57.39) (24,57.80) (29,58.78) (30,59.21) (31,59.71) (32,60.04) (37,60.77) (38,61.47) (39,61.20) (40,61.67) (45,61.99) (46,62.73) (47,62.80) (48,62.88) (53,63.73) (54,64.23) (55,64.22) (56,64.65) (61,65.26) (62,65.51) (63,65.44) (64,65.87) (69,66.64) (70,66.76) (71,67.16) (72,67.37) (75,67.60) (84,68.61) (85,68.75) (86,68.73) (87,68.84) (88,68.86) (89,68.84) (90,68.83) }; \addlegendentry{top-1} \addplot[orange, mark=*, mark options={fill=orange}] coordinates { (5,60.54) (6,65.37) (7,68.80) (8,71.29) (13,77.44) (14,78.10) (15,78.59) (16,78.85) (21,80.74) (22,80.87) (23,81.23) (24,81.70) (29,82.27) (30,82.46) (31,82.71) (32,83.10) (37,83.59) (38,83.99) (39,83.70) (40,84.19) (45,84.49) (46,84.83) (47,84.94) (48,84.98) (53,85.58) (54,85.64) (55,85.64) (56,86.07) (61,86.45) (62,86.52) (63,86.65) (64,86.85) (69,87.27) (70,87.47) (71,87.56) (72,87.67) (75,87.84) (84,88.51) (85,88.53) (86,88.62) (87,88.60) (88,88.64) (89,88.63) (90,88.63) }; \addlegendentry{top-5} \end{axis} \end{tikzpicture}$

MobileNetV2 / ImageNet-1k validation accuracy per epoch (bf16, 6$\times $ 4060 Ti, RMSProp).

[TODO — run w/ config.] mobilenet-v2-imagenet full tier (recipe arg full), 350 epochs. Changed since the $68.77\% $ above: label smoothing $0.1\to 0.0$ (the MobileNetV2 paper used none; the number above predates the fix), and the schedule goes $90\to 350$. The remaining $\sim $3 points to the paper’s $72.0\% $ are schedule, not recipe: MobileNetV2 was trained with RMSProp under an exponential LR decay ($\sim $0.98 per epoch) for several hundred epochs, where the 90-epoch tier here is a deliberately short validation tier. The same holds across the family — EfficientNet-B0 ($\sim $77%) and ConvNeXt-T ($\sim $82%) both assume $\sim $300–350-epoch schedules where their heavier augmentation (AutoAugment; Mixup/CutMix/RandErase) finally pays off rather than under-training. A dedicated paper-faithful pass — each net at its full schedule on the six-GPU box ($\sim $35 hr apiece) — is left as future work to close these gaps.

The general pattern — deeper + narrower + per-channel conv + surrounding $1 \times 1$ expansion — carries forward to MobileNet V3 and the EfficientNet family. All of those add their own small modifications on top (SE blocks, Swish activations, compound scaling) but the core depthwise-separable scaffolding is exactly what this chapter formalized.