Dependency graph

Legend

Boxes: definitions
Ellipses: theorems and lemmas
Blue border: the statement of this result is ready to be formalized; all prerequisites are done
Orange border: the statement of this result is not ready to be formalized; the blueprint needs more work
Blue background: the proof of this result is ready to be formalized; all prerequisites are done
Green border: the statement of this result is formalized
Green background: the proof of this result is formalized
Dark green background: the proof of this result and all its ancestors are formalized
Dark green border: this is in Mathlib

Theorem 32BN inverse-stddev broadcast smoothness

assume:

\(\varepsilon {\gt} 0\) [h\(\varepsilon \)]

prove: \(\operatorname {bnIstdBroadcast}= x \mapsto 1/\sqrt{\sigma ^2(x) + \varepsilon }\) is \(\mathsf{Differentiable}\). This is the sqrt/recip smoothness that the product rule needs inside the normalize Jacobian.

LaTeX Lean

Definition 23Conv2d forward

Concrete \(\sum _{c,kh,kw}\) cross-correlation with SAME padding (Phase 7). Codegen emits stablehlo.convolution; proofs reason about it via the explicit definition.

LaTeX Lean

Theorem 27Conv2d bias VJP

Phase 9: sum the cotangent over spatial dims per channel. prove: \(\mathsf{HasVJP}\, \bigl(b \mapsto \mathrm{flatten} (\mathrm{conv2d}(W, b)\, x)\bigr)\), with backward \(db_o = \sum _{h_i, w_i} dy_{\mathrm{flat}(o, h_i, w_i)}\).

LaTeX Lean

Theorem 25Conv2d input VJP

\(\mathsf{HasVJP3}\, \bigl(\mathrm{conv2d}(W, b)\bigr)\): the input backward of a convolution is the reversed-kernel convolution. Phase 1 (Apr 2026). No hypotheses: conv2d is affine in its input.

LaTeX Lean

Theorem 26Conv2d weight VJP

Phase 7: the transpose-trick formula. As with the dense weight Jacobian (Theorem 15), differentiate in the flattened kernel: let \(F(v) := \mathrm{flatten}\bigl(\mathrm{conv2d} (\mathrm{unflatten}\, v,\, b)\, x\bigr)\) for \(v \in \mathbb {R}^{o_c \cdot i_c \cdot k_H \cdot k_W}\). prove: \(\mathsf{HasVJP}\, F\), with backward \(dW_{o, c, k_h, k_w} = \sum _{h_i, w_i} [\text{pad-valid}]\; x_{c,\, h_i + k_h - p_H,\, w_i + k_w - p_W} \cdot dy_{o, h_i, w_i}\) — a convolution of the saved input against the upstream gradient.

LaTeX Lean

Theorem 41Depthwise bias VJP

Phase 9. prove: \(\mathsf{HasVJP}\, \bigl(b \mapsto \mathrm{flatten} (\mathrm{depthwiseConv2d}(W, b)\, x)\bigr)\), with backward \(db_c = \sum _{h_i, w_i} dy_{\mathrm{flat}(c, h_i, w_i)}\).

LaTeX Lean

Theorem 39Depthwise input VJP

Phase 2 (Apr 2026). prove: \(\mathsf{HasVJP3}\, \bigl(\mathrm{depthwiseConv2d}(W, b)\bigr)\), with the per-channel reversed-kernel backward \(dx_{c, h, w} = \sum _{k_h, k_w} W_{c,\, k_H - 1 - k_h,\, k_W - 1 - k_w} \cdot dy_{c,\, h + k_h - p,\, w + k_w - p}\).

LaTeX Lean

Theorem 40Depthwise weight VJP

Phase 7. DepthwiseKernel is definitionally \(\mathsf{Tensor3}\), so \(\mathsf{HasVJP3}\) applies to the kernel directly — no flatten bijection needed, unlike the conv2d weight VJP (Theorem 26). prove: \(\mathsf{HasVJP3}\, \bigl(W \mapsto \mathrm{depthwiseConv2d}(W, b)\, x\bigr)\), with the per-channel transpose-trick backward \(dW_{c, k_h, k_w} = \sum _{h_o, w_o} [\text{pad-valid}]\; x_{c,\, k_h + h_o - p_H,\, k_w + w_o - p_W} \cdot dy_{c, h_o, w_o}\).

LaTeX Lean

Definition 38Depthwise conv forward

Concrete per-channel cross-correlation (Phase 7).

LaTeX Lean

Definition 43GELU scalar function

The tanh approximation the codegen actually emits (not the exact \(x \cdot \Phi (x)\) erf form): \(\mathrm{geluScalar}(x) = 0.5\, x\, \bigl(1 + \tanh (\sqrt{2/\pi }\, (x + 0.044715\, x^3))\bigr)\).

LaTeX Lean

Definition 44GELU scalar derivative

Defined as Mathlib’s \(\mathrm{deriv}\, \mathrm{geluScalar}\) — so the connection to the forward is automatic, not asserted. The closed form (what the verified geluBack emitter renders) is the separate theorem geluScalarDeriv_eq, proved by assembling HasDerivAt for the inner polynomial, \(\tanh \) (via \(\tanh = \sinh /\cosh \) and the quotient rule), and the outer product.

LaTeX Lean

Definition 28MaxPool 2x2 stride 2 forward

Concrete four-way max over 2x2 windows (Phase 7).

LaTeX Lean

Definition 29MaxPool input VJP

noncomputable def over the canonical pdiv-derived witness. HasVJP.correct holds by rfl; codegen substitutes the standard argmax routing convention at tiebreaks.

LaTeX Lean

Theorem 71Multi-head SDPA VJP

Phase 3 (Apr 2026), de-axiomatized Phase 8. prove: \(\mathsf{HasVJPMat}\, (\mathrm{mhsa\_ layer})\).

LaTeX Lean

Theorem 72MHSA layer smoothness

\(\mathsf{Differentiable}\) sibling of Theorem 71.

LaTeX Lean

Definition 22MLP composition VJP

noncomputable def over the canonical pdiv-derived witness; same shape as relu_has_vjp. Codegen routes the ReLU subgradient at the kink.

LaTeX Lean

Theorem 74Patch-embedding smoothness

\(\mathsf{Differentiable}\) sibling of Theorem 73.

LaTeX Lean

Theorem 73Patch-embedding VJP

Phase 6b (Apr 2026): de-opaqued in 6a — the concrete def unfolds to conv-with-stride + CLS prepend + positional embed. prove: \(\mathsf{HasVJP}\, (\mathrm{patchEmbed\_ flat})\), with backward the deconvolution formula: a sum over patches with reconstructed kernel offsets, the CLS row contributing nothing to the image gradient.

LaTeX Lean

Definition 1Partial derivative

The partial derivative function. For \(f : \mathbb {R}^{m} \to \mathbb {R}^{n}\), \(\operatorname {pdiv}\, f\, x\, i\, j\) is the \((i, j)\)-entry of the Jacobian at \(x\), defined as \(\operatorname {fderiv}_{\mathbb {R}}\, f\, x\, (\mathbf{e}_i)\, j\) — the \(j\)-th coordinate of Mathlib’s Fréchet derivative applied to the \(i\)-th standard basis vector.

LaTeX Lean

Theorem 3Sum rule

assume:

\(f, g : \mathbb {R}^{m} \to \mathbb {R}^{n}\) are both differentiable at \(x\) [hf, hg]

prove: for all \(i, j\):

\[ \operatorname {pdiv}(f + g)\, x\, i\, j = \operatorname {pdiv}f\, x\, i\, j + \operatorname {pdiv}g\, x\, i\, j. \]

LaTeX Lean

Theorem 30BN affine step Jacobian

For \(\mathrm{bnAffine}(\gamma , \beta )\, v = \lambda i.\; \gamma v_i + \beta \): prove: \(\operatorname {pdiv}\bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)\, v\, i\, j = \gamma \, \delta _{ij}\).

LaTeX Lean

Theorem 31BN centering Jacobian

For \(\mathrm{bnCentered}\, x = \lambda j.\; x_j - \mu (x)\), where \(\mu (x) = \frac{1}{n}\sum _s x_s\): prove: \(\operatorname {pdiv}(\mathrm{bnCentered})\, x\, i\, j = \delta _{ij} - 1/n\).

LaTeX Lean

Theorem 33BN inverse-stddev broadcast Jacobian

assume:

\(\varepsilon {\gt} 0\) [h\(\varepsilon \)]

prove: writing \(s := \mathrm{istd}(x, \varepsilon ) = 1/\sqrt{\sigma ^2(x) + \varepsilon }\):

\[ \operatorname {pdiv}(\operatorname {bnIstdBroadcast})\, x\, i\, j = -s^3 \cdot (x_i - \mu )/n. \]

LaTeX Lean

Theorem 2Chain rule

assume:

\(f : \mathbb {R}^{m} \to \mathbb {R}^{n}\) is differentiable at \(x\) [hf]
\(g : \mathbb {R}^{n} \to \mathbb {R}^{p}\) is differentiable at \(f(x)\) [hg]

prove: for all \(i, k\):

\[ \operatorname {pdiv}(g \circ f)\, x\, i\, k = \sum _j \operatorname {pdiv}f\, x\, i\, j \cdot \operatorname {pdiv}g\, (f\, x)\, j\, k. \]

LaTeX Lean

Theorem 6Constant has zero Jacobian

LaTeX Lean

Theorem 14Dense Jacobian wrt input

For the dense layer \(\mathrm{dense}(W, b)\, x = \lambda j.\; \bigl(\sum _i x_i\, W_{ij}\bigr) + b_j\):

\[ \operatorname {pdiv}\bigl(\mathrm{dense}(W, b)\bigr)\, x\, i\, j = W_{ij}. \]

No hypotheses: dense is affine, so the differentiability obligations are discharged inside the proof rather than assumed.

LaTeX Lean

Theorem 15Dense Jacobian wrt weight

Phase 7; the symmetric counterpart of Theorem 14, differentiating in \(W\) instead of \(x\). Since \(\operatorname {pdiv}\) works on vectors, view the layer as a function of the flattened weights: for \(v \in \mathbb {R}^{m \cdot n}\), let \(F(v) := \mathrm{dense}(\mathrm{unflatten}\, v,\, b)\, x\), and write \(\varphi \) for the index bijection \((i, j) \leftrightarrow \varphi (i, j)\) (finProdFinEquiv). prove: for all \(i, j', j\):

\[ \operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j')\, j = \delta _{jj'}\, x_i. \]

LaTeX Lean

Theorem 45GELU Jacobian

prove: \(\operatorname {pdiv}(\mathrm{gelu})\, x\, i\, j = \delta _{ij} \cdot \mathrm{geluScalar}'(x_i)\) — the diagonal activation Jacobian.

LaTeX Lean

Theorem 5Identity Jacobian

\(\operatorname {pdiv}(\mathrm{id})\, x\, i\, j = \delta _{ij}\).

LaTeX Lean

Theorem 4Product rule

assume:

\(f, g : \mathbb {R}^{m} \to \mathbb {R}^{n}\) are both differentiable at \(x\) [hf, hg]

prove: for all \(i, j\):

\[ \operatorname {pdiv}(f \odot g)\, x\, i\, j = \operatorname {pdiv}f\, x\, i\, j \cdot g\, x\, j + f\, x\, j \cdot \operatorname {pdiv}g\, x\, i\, j, \]

where \((f \odot g)\, y\, k = f\, y\, k \cdot g\, y\, k\) is the elementwise product.

LaTeX Lean

Theorem 7Gather / reindex Jacobian

Covers permutations, reshapes, slicing. Generalizes pdiv_id.

LaTeX Lean

Theorem 16ReLU Jacobian (guarded subgradient)

assume:

\(x\) is a smooth point of \(\mathrm{ReLU}\): every coordinate \(x_k \neq 0\) [h_smooth]

prove: for all \(i, j\):

\[ \operatorname {pdiv}(\mathrm{ReLU})\, x\, i\, j = \delta _{ij} \cdot [x_i {\gt} 0], \]

where \([P]\) is the Iverson bracket (\(1\) if \(P\) holds, else \(0\)).

LaTeX Lean

Theorem 65Softmax Jacobian

Writing \(p := \mathrm{softmax}(z)\): prove: \(\operatorname {pdiv}(\mathrm{softmax})\, z\, i\, j = p_j\, (\delta _{ij} - p_i)\).

LaTeX Lean

Theorem 49Row-independence for matrices

assume:

\(g : \mathbb {R}^{n} \to \mathbb {R}^{p}\) is differentiable everywhere [h_g_diff]

prove: applying \(g\) to each row of a matrix has block-diagonal Jacobian: \(\operatorname {pdivMat}(\text{rowwise } g)\, A\, (i,j)\, (k,l) = [i = k] \cdot \operatorname {pdiv}g\, (A_i)\, j\, l\).

LaTeX Lean

Theorem 78Row-wise softmax smoothness

prove: the flattened \(\mathrm{rowSoftmax}\) is \(\mathsf{Differentiable}\).

LaTeX Lean

Theorem 17Softmax cross-entropy gradient

Write \(p := \mathrm{softmax}(z)\) and \(\mathrm{CE}(z, \ell ) := -\log p_\ell \) (viewed as \(\mathbb {R}^{1}\)-valued so \(\operatorname {pdiv}\) applies; we read off the only output coordinate). prove: for all \(j\):

\[ \operatorname {pdiv}\bigl(\mathrm{CE}(\cdot , \ell )\bigr)\, z\, j\, 0 = p_j - \mathrm{onehot}(\ell )_j. \]

LaTeX Lean

Definition 9VJP record

For \(f : \mathbb {R}^{m} \to \mathbb {R}^{n}\), \(\mathsf{HasVJP}\, f\) bundles a backward function \(B : \mathbb {R}^{m} \to \mathbb {R}^{n} \to \mathbb {R}^{m}\) with its correctness claim: for all \(x\), \(dy\), \(i\),

\[ B(x, dy)_i = \sum _j \operatorname {pdiv}f\, x\, i\, j \cdot dy_j. \]

Exhibiting a \(\mathsf{HasVJP}\, f\) is exactly the statement “this backward function computes the vector–Jacobian product of \(f\).”

LaTeX Lean

Definition 243D VJP record

The rank-3 analogue of Definition 9. For \(f : \mathrm{Tensor3} \to \mathrm{Tensor3}\), define \(\operatorname {pdiv}_3 f\, x\, (c_i, h_i, w_i)\, (c_o, h_o, w_o)\) as \(\operatorname {pdiv}\) of the flattened map \(v \mapsto \mathrm{flatten}\bigl(f(\mathrm{unflatten}\, v)\bigr)\) with both indices decoded through the \((c, h, w) \leftrightarrow \text{flat}\) bijection. Then \(\mathsf{HasVJP3}\, f\) bundles a backward function \(B\) with: for all \(x\), \(dy\), \((c_i, h_i, w_i)\),

\[ B(x, dy)_{c_i h_i w_i} = \sum _{c_o, h_o, w_o} \operatorname {pdiv}_3 f\, x\, (c_i, h_i, w_i)\, (c_o, h_o, w_o) \cdot dy_{c_o h_o w_o}. \]

LaTeX Lean

Definition 48Matrix VJP record

The rank-2 analogue of Definition 9. Define \(\operatorname {pdivMat}f\, A\, (i,j)\, (k,l)\) as \(\operatorname {pdiv}\) of the row-major flattened map \(v \mapsto \mathrm{flatten}\bigl(f(\mathrm{unflatten}\, v)\bigr)\) with both index pairs encoded through the \((i, j) \leftrightarrow \text{flat}\) bijection. Then \(\mathsf{HasVJPMat}\, f\) bundles a backward function \(B\) with: for all \(A\), \(dY\), \((i, j)\),

\[ B(A, dY)_{ij} = \sum _{k, l} \operatorname {pdivMat}f\, A\, (i,j)\, (k,l) \cdot dY_{kl}. \]

LaTeX Lean

Definition 100Atrous Spatial Pyramid Pooling

DeepLab v3+’s marquee module (Chen et al. 2018). Five parallel branches emitting oc channels each: (1) 1\(\times \)1 conv, (2–4) 3\(\times \)3 atrous convs at dilation rates 6 / 12 / 18, (5) global avg-pool + 1\(\times \)1 conv + bilinear upsample. Concatenate, then a 1\(\times \)1 fusion conv back to oc. All branches include BN + ReLU. Atrous rates widen the effective receptive field without changing param count; the pool branch supplies image-level context. Signature: asppModule (ic oc : Nat).

LaTeX Lean

Definition 94Cross-Stage Partial block

CSP (Wang et al. 2019), used by YOLOv4 onward. Splits input into two halves, processes one half through a stack of residual blocks, then concatenates with the untouched half and 1\(\times \)1-projects to oc. The specific inner block varies across YOLO versions (C3 in v5, C2f in v8, C3k2 in v11); this primitive approximates all three at the same abstraction level. Signature: cspBlock (ic oc nBlocks : Nat).

LaTeX Lean

Definition 93Darknet residual block

YOLOv3’s Darknet-53 residual stack (Redmon & Farhadi 2018). nBlocks residual blocks at fixed channels, each being 1\(\times \)1 conv (\(c \to c/2\)) \(+\) 3\(\times \)3 conv (\(c/2 \to c\)) \(+\) residual add. Lighter than a standard ResNet bottleneck; heavier than a ResNet-18 basic block. Signature: darknetBlock (channels nBlocks : Nat).

LaTeX Lean

Definition 86DenseNet dense block

DenseNet’s bundled dense block (Huang et al. 2017). nLayers BN-ReLU-1\(\times \)1-BN-ReLU-3\(\times \)3 sub-stacks, each adding growthRate channels to the running concatenation of preceding outputs. Input has ic channels; output has ic + nLayers \(\cdot \) growthRate. Bundled because each sub-layer reads a growing concat of all preceding sub-layers, which doesn’t fit a linear NetSpec at sub-layer granularity. Signature: denseBlock (ic growthRate nLayers : Nat).

LaTeX Lean

Definition 97DETR prediction heads

Per-query class head (linear dim \(\to \) nClasses+1 with the “no object” slot) + box head (3-layer MLP to 4 scalars (cx, cy, w, h)). Signature: detrHeads (dim nClasses : Nat).

LaTeX Lean

Definition 105Evoformer block (AlphaFold 2)

Dual-representation (MSA + pair) joint update: MSA row-attention with pair bias + MSA column-attention + MSA transition + outer-product-mean (\(\to \) pair) + triangle multiplicative (outgoing, incoming) + triangle self-attention (starting node, ending node) + pair transition. Signature: evoformerBlock (msaChannels pairChannels nBlocks : Nat). The triangulation-aware operations are the key inductive bias.

LaTeX Lean

Definition 95Feature Pyramid Network

Lin et al. 2017. Takes the four stage outputs of a CNN backbone (channels c2/c3/c4/c5 at strides \(4/8/16/32\)), projects each to target channels with a \(1\times 1\) lateral conv, merges them top-down (upsample-2\(\times \) then elementwise add), and applies a \(3\times 3\) smoothing conv at each merged level. Output: four feature maps, each target-wide, at the original spatial resolutions. Bundled because the cross-scale add doesn’t fit a linear NetSpec. Standard kit in 2-stage detectors (Mask R-CNN, Cascade R-CNN) and single-stage detectors (RetinaNet). Signature: fpnModule (c2 c3 c4 c5 target : Nat).

LaTeX Lean

Definition 85Inception module

The GoogLeNet parallel-branch module (Szegedy et al. 2014). Four branches computed in parallel: 1\(\times \)1 conv (b1 channels); 1\(\times \)1 reduce then 3\(\times \)3 (b2); 1\(\times \)1 reduce then 5\(\times \)5 (b3); 3\(\times \)3 maxPool then 1\(\times \)1 (b4). Concat along channels for b1 + b2 + b3 + b4 outputs. The 1\(\times \)1 dimension reducers on branches 2 and 3 are the paper’s trick — they make the expensive 3\(\times \)3 and 5\(\times \)5 convs operate on reduced channel counts. Signature: inceptionModule (ic b1 b2reduce b2 b3reduce b3 b4 : Nat).

LaTeX Lean

Definition 101Mamba block

Selective state-space block in the S6 formulation (Gu & Dao 2023). Signature: mambaBlock (dim stateSize expand nBlocks : Nat). Bundles RMSNorm + linear expand + depthwise 1D conv + SiLU + selective-scan SSM + gated product + output projection, for nBlocks stacked layers. The selective scan is the novel primitive; everything else could be decomposed to existing layers if we cared to unpack the bundle.

LaTeX Lean

Definition 90MobileViT block

Hybrid local-conv + patch-level transformer (Mehta & Rastegari 2022). Local 3\(\times \)3 conv \(\to \) 1\(\times \)1 projection to transformer dim \(\to \) unfold into patches \(\to \) \(L\) transformer blocks across patches \(\to \) fold back \(\to \) 1\(\times \)1 projection back \(\to \) concat with input \(\to \) 3\(\times \)3 fusion. Signature: mobileVitBlock (ic dim heads mlpDim nTxBlocks : Nat). The unfold/fold operations are pdiv_reindex-style shape transformations; all the genuinely new math is already covered by the transformer proof chapter.

LaTeX Lean

Definition 104NeRF MLP core

The whole NeRF network (Mildenhall et al. 2020) bundled as one primitive: 8 hidden ReLU-FC layers of hiddenDim, mid-skip concatenating \(\gamma (x)\) at layer 5, dual output heads (1-dim volume density \(\sigma \) + 3-dim RGB via a direction-conditioned branch). Under 600K parameters at the canonical config. Signature: nerfMLP (encodedPosDim encodedDirDim hiddenDim : Nat).

LaTeX Lean

Definition 92Patch merging

Swin’s 2\(\times \)2 spatial downsample + linear channel projection (inDim \(\to \) outDim). Transformer-side analog of a stride-2 conv.

LaTeX Lean

Definition 103Positional encoding

Sinusoidal frequency basis (Vaswani 2017, reused by NeRF 2020): \(\gamma (p) = (\sin 2^0 \pi p, \cos 2^0 \pi p, \ldots , \sin 2^{L-1} \pi p, \cos 2^{L-1} \pi p)\). Zero trainable parameters — it’s a deterministic lift of a low-dim coordinate into a high-frequency feature space where an MLP has enough wiggle room to represent sharp details. Output dim \(=\) inputDim \(\cdot 2 \cdot \) numFrequencies. Signature: positionalEncoding (inputDim numFrequencies : Nat).

LaTeX Lean

Definition 88ShuffleNet v1 stage

Grouped 1\(\times \)1 conv + channel-shuffle permutation + 3\(\times \)3 depthwise + grouped 1\(\times \)1 conv, for nUnits units (first downsampling, rest residual). The shuffle is parameter-free; grouping reduces 1\(\times \)1 cost by \(g\). Signature: shuffleBlock (ic oc groups nUnits : Nat).

LaTeX Lean

Definition 89ShuffleNet v2 stage

nUnits v2 units (Ma et al. 2018). Basic unit (stride 1): channel-split \([X_1, X_2]\), leave \(X_1\) alone, run \(X_2\) through \(1\times 1 \to 3\times 3\) DW \(\to 1\times 1\) at half-width, concat, then channel-shuffle. Downsample unit (stride 2): both branches see the full input; left does DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1, right does 1\(\times \)1 \(+\) DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1; concat doubles channels. v2 throws out v1’s grouped 1\(\times \)1 convs (G2) and skip-add (G4) per the paper’s practical guidelines. Signature: shuffleV2Block (ic oc nUnits : Nat).

LaTeX Lean

Definition 106Structure Module (AlphaFold 2)

Recurrent Invariant Point Attention + backbone frame update + side-chain \(\chi \)-angle prediction. Weights shared across nBlocks rounds — param count does not multiply by nBlocks. Signature: structureModule (singleChannels pairChannels nBlocks : Nat).

LaTeX Lean

Definition 91Swin Transformer stage

Windowed multi-head self-attention at fixed spatial resolution (Liu et al. 2021). Signature: swinStage (dim heads mlpDim windowSize nBlocks : Nat). Internal blocks alternate W-MSA and SW-MSA (shifted-window) to let information cross window boundaries.

LaTeX Lean

Definition 96Transformer decoder (DETR-style)

nBlocks blocks of self-attention over nQueries learned object queries + cross-attention against an encoder output + FFN. The query embedding is part of the layer’s parameters. Signature: transformerDecoder (dim heads mlpDim nBlocks nQueries : Nat).

LaTeX Lean

Definition 87DenseNet transition layer

Inter-block downsample for DenseNet. BN + 1\(\times \)1 conv (ic \(\to \) oc) + 2\(\times \)2 average-pool stride 2. Halves spatial resolution and (typically) halves channel count via the 1\(\times \)1 conv to compress feature reuse between dense blocks. Signature: transitionLayer (ic oc : Nat).

LaTeX Lean

Definition 98UNet encoder stage

\(2 \times \) (conv3\(\times \)3 + BN + ReLU) then maxPool-2. Saves its pre-pool activation as a skip for the matching unetUp. Signature: unetDown (ic oc : Nat).

LaTeX Lean

Definition 99UNet decoder stage

Transposed-conv \(2\times \) upsample + concat with matching skip + \(2 \times \) (conv3\(\times \)3 + BN + ReLU). Signature: unetUp (ic oc : Nat), where oc is both the output channel count and the expected skip width.

LaTeX Lean

Definition 102WaveNet block

One stack of nLayers dilated causal residual blocks with doubling dilation rates \(2^0, 2^1, \ldots , 2^{\texttt{nLayers}-1}\) (van den Oord et al. 2016). Each block: dilated 2-tap causal conv \(\to \) gated activation \(\tanh (\text{filter}) \odot \sigma (\text{gate})\) \(\to \) 1\(\times \)1 project back to residualCh (residual path) + 1\(\times \)1 skip projection to skipCh. Skip outputs across blocks are summed into the final head. Signature: waveNetBlock (residualCh skipCh nLayers : Nat). Output channels are skipCh: bestiary convention picks the skip path as the "forward" output since it’s what feeds the final classifier.

LaTeX Lean

Theorem 643D additive fan-in

Rank-3 transcription of Theorem 11: backward is the sum of the two backwards.

LaTeX Lean

Theorem 11Additive fan-in VJP

Used for residual connections. assume:

\(B_f\) is a correct backward function for \(f\) (\(\mathsf{HasVJP}\, f\)) [hf]
\(B_g\) is a correct backward function for \(g\) (\(\mathsf{HasVJP}\, g\)) [hg]
\(f\) is differentiable everywhere [hf_diff]
\(g\) is differentiable everywhere [hg_diff]

prove: \(\mathsf{HasVJP}\, (f + g)\).

LaTeX Lean

Theorem 51Matrix-level additive fan-in

Rank-2 transcription of Theorem 11: given \(\mathsf{HasVJPMat}\) for \(F\) and \(G\) (with flat differentiability), \(\mathsf{HasVJPMat}\, (F + G)\) with backward \(B_F(A, dY) + B_G(A, dY)\).

LaTeX Lean

Theorem 36Full BN VJP

assume:

\(\varepsilon {\gt} 0\) [h\(\varepsilon \)]

prove: \(\mathsf{HasVJP}\, \bigl(\mathrm{bnForward}(\varepsilon , \gamma , \beta )\bigr)\).

LaTeX Lean

Theorem 35BN affine VJP

\(\mathsf{HasVJP}\, \bigl(\mathrm{bnAffine}(\gamma , \beta )\bigr)\): each input feeds one output scaled by \(\gamma \), so the gradient comes back scaled by \(\gamma \).

LaTeX Lean

Theorem 34BN normalize 3-term VJP

assume:

\(\varepsilon {\gt} 0\) [h\(\varepsilon \)]

prove: \(\mathsf{HasVJP}\, (\mathrm{bnNormalize})\), with the consolidated three-term backward (writing \(s := \mathrm{istd}\), \(\hat{x} := \mathrm{bnXhat}\)):

\[ B(x, d\hat{x})_i = \tfrac {1}{n}\, s \Bigl( n\, d\hat{x}_i - \sum _j d\hat{x}_j - \hat{x}_i \sum _j \hat{x}_j\, d\hat{x}_j \Bigr). \]

LaTeX Lean

Theorem 20Dense bias gradient is identity

\(db = dy\). Phase 7: derived, no new axiom. prove: for all \(i\):

\[ dy_i = \sum _j \operatorname {pdiv}\bigl(b' \mapsto \mathrm{dense}(W, b')\, x\bigr)\, b\, i\, j \cdot dy_j. \]

LaTeX Lean

Theorem 18Dense VJP

\(\mathsf{HasVJP}\, \bigl(\mathrm{dense}(W, b)\bigr)\): the input-gradient backward of a dense layer is multiplication by \(W\).

LaTeX Lean

Theorem 76Per-token dense lifted to a matrix

LaTeX Lean

Theorem 19Dense weight gradient is the outer product

\(dW = x \otimes dy\). Phase 7 promoted from vacuous rfl to theorem: with \(F\) and \(\varphi \) as in Theorem 15, prove: for all \(i, j\):

\[ (x \otimes dy)_{ij} = \sum _k \operatorname {pdiv}F\, (\mathrm{flatten}\, W)\, \varphi (i, j)\, k \cdot dy_k. \]

LaTeX Lean

Theorem 12Multiplicative fan-in VJP

Used for Squeeze-and-Excitation. assume:

\(B_f\) is a correct backward function for \(f\) (\(\mathsf{HasVJP}\, f\)) [hf]
\(B_g\) is a correct backward function for \(g\) (\(\mathsf{HasVJP}\, g\)) [hg]
\(f\) is differentiable everywhere [hf_diff]
\(g\) is differentiable everywhere [hg_diff]

prove: \(\mathsf{HasVJP}\, (f \odot g)\), where \((f \odot g)\, x\, i = f\, x\, i \cdot g\, x\, i\).

LaTeX Lean

Theorem 47GELU VJP

\(\mathsf{HasVJP}\, (\mathrm{gelu})\), with backward \(B(x, dy)_i = dy_i \cdot \mathrm{geluScalar}'(x_i)\).

LaTeX Lean

Theorem 77Per-token GELU lifted to a matrix

LaTeX Lean

Theorem 13Identity VJP

LaTeX Lean

Theorem 52Matrix-level identity

\(\mathsf{HasVJPMat}\, (\mathrm{id})\), backward \(B(A, dY) = dY\).

LaTeX Lean

Theorem 46LayerNorm VJP

assume:

\(\varepsilon {\gt} 0\) [h\(\varepsilon \)]

prove: \(\mathsf{HasVJP}\, (\mathrm{layerNormForward})\).

LaTeX Lean

Theorem 75Per-token LayerNorm lifted to a matrix

LaTeX Lean

Theorem 57Matmul VJP, left factor fixed

\(\mathsf{HasVJPMat}\, (B' \mapsto C \cdot B')\), backward \(dB = C^{T} \cdot dY\).

LaTeX Lean

Theorem 58Matmul VJP, right factor fixed

\(\mathsf{HasVJPMat}\, (A' \mapsto A' \cdot D)\), backward \(dA = dY \cdot D^{T}\).

LaTeX Lean

Theorem 623D chain rule

prove: for flattened-differentiable \(f\), \(g\): \(\operatorname {pdiv}_3(g \circ f)\) is the middle-index contraction \(\sum _{c_j, h_j, w_j} \operatorname {pdiv}_3 f \cdot \operatorname {pdiv}_3 g\).

LaTeX Lean

Theorem 8Finite-sum rule

assume:

\(S\) is a finite index set with a function \(f_s : \mathbb {R}^{m} \to \mathbb {R}^{n}\) for each \(s \in S\)
every \(f_s\) is differentiable at \(x\) [hdiff]

prove: for all \(i, j\):

\[ \operatorname {pdiv}\Bigl(\sum _{s \in S} f_s\Bigr)\, x\, i\, j = \sum _{s \in S} \operatorname {pdiv}f_s\, x\, i\, j. \]

LaTeX Lean

Theorem 55Matmul Jacobian, left factor fixed

Phase 6: derived, not axiomatized. prove: \(\operatorname {pdivMat}(B' \mapsto C \cdot B')\, B\, (i,j)\, (k,l) = [l = j] \cdot C_{ki}\).

LaTeX Lean

Theorem 56Matmul Jacobian, right factor fixed

prove: \(\operatorname {pdivMat}(A' \mapsto A' \cdot D)\, A\, (i,j)\, (k,l) = [i = k] \cdot D_{jl}\).

LaTeX Lean

Theorem 53Scalar-scale Jacobian

Phase 6 derivation. prove: \(\operatorname {pdivMat}(M \mapsto s \cdot M)\, A\, (i,j)\, (k,l) = [i = k \wedge j = l] \cdot s\).

LaTeX Lean

Theorem 54Transpose Jacobian

prove: \(\operatorname {pdivMat}(\mathrm{transpose})\, A\, (i,j)\, (k,l) = [j = k \wedge i = l]\).

LaTeX Lean

Definition 21ReLU VJP

noncomputable def over the canonical pdiv-derived witness; HasVJP.correct holds by rfl since \(\operatorname {pdiv}\) is a def over \(\operatorname {fderiv}\). At non-smooth points the canonical backward is \(\operatorname {fderiv}\)’s junk default of \(0\); the codegen substitutes the standard subgradient convention.

LaTeX Lean

Theorem 37Residual block VJP

assume:

\(B_f\) is a correct backward function for \(f\) (\(\mathsf{HasVJP}\, f\)) [hf]
\(f\) is differentiable everywhere [hf_diff]

prove: \(\mathsf{HasVJP}\, (\mathrm{residual}\, f)\), where \(\mathrm{residual}\, f\, x = f(x) + x\), with backward \(B(x, dy) = B_f(x, dy) + dy\).

LaTeX Lean

Theorem 67Row-wise softmax VJP on a matrix

\(\mathsf{HasVJPMat}\, (\mathrm{rowSoftmax})\): rows are independent, so the backward applies the softmax backward per row.

LaTeX Lean

Theorem 61Row-wise lifting of any \(\mathsf{HasVJP}\)

Phase 8 generic helper. assume:

\(B_g\) is a correct backward for \(g : \mathbb {R}^{n} \to \mathbb {R}^{p}\) (\(\mathsf{HasVJP}\, g\)) [hg]
\(g\) is differentiable everywhere [hg_diff]

prove: \(\mathsf{HasVJPMat}\) of the rowwise map on \(\mathbb {R}^{m \times n} \to \mathbb {R}^{m \times p}\), with backward applying \(B_g\) to each row: \(B(A, dY)_r = B_g(A_r, dY_r)\).

LaTeX Lean

Theorem 59Scalar-scale VJP

\(\mathsf{HasVJPMat}\, (M \mapsto s \cdot M)\), backward \(dA = s \cdot dY\).

LaTeX Lean

Theorem 69SDPA backward wrt K

prove: for fixed \(Q, V\): \(\mathrm{sdpa\_ back\_ K}\) matches the \(\operatorname {pdivMat}\) contraction of \(K' \mapsto \mathrm{sdpa}(Q, K', V)\).

LaTeX Lean

Theorem 68SDPA backward wrt Q

prove: for fixed \(K, V\): \(\mathrm{sdpa\_ back\_ Q}\) matches the \(\operatorname {pdivMat}\) contraction of \(Q' \mapsto \mathrm{sdpa}(Q', K, V)\).

LaTeX Lean

Theorem 70SDPA backward wrt V

prove: for fixed \(Q, K\): \(\mathrm{sdpa\_ back\_ V}\) matches the \(\operatorname {pdivMat}\) contraction of \(V' \mapsto \mathrm{sdpa}(Q, K, V')\).

LaTeX Lean

Theorem 42SE block VJP

assume:

\(B_g\) is a correct backward function for the gate (\(\mathsf{HasVJP}\, \mathrm{gate}\)) [hg]
\(\mathrm{gate}\) is differentiable everywhere [hg_diff]

prove: \(\mathsf{HasVJP}\, (\mathrm{seBlock}\, \mathrm{gate})\), where \(\mathrm{seBlock}\, \mathrm{gate}\, x = x \odot \mathrm{gate}(x)\), with backward

\[ B(x, dy) = \mathrm{gate}(x) \odot dy + B_g\bigl(x,\; x \odot dy\bigr). \]

SE multiplies input by a sigmoid-gated channel mask.

LaTeX Lean

Theorem 66Standalone softmax VJP

\(\mathsf{HasVJP}\, (\mathrm{softmax})\), with the closed-form \(O(c)\) backward

\[ B(z, dy)_i = p_i\, \bigl(dy_i - \langle p, dy \rangle \bigr), \]

where \(\langle p, dy \rangle = \sum _j p_j\, dy_j\) is one precomputed scalar — the rank-1 structure of the Jacobian is what turns the naive \(O(c^2)\) contraction into a reduction plus a broadcast, the same optimization pattern as BN and max-pool.

LaTeX Lean

Theorem 80Transformer attention sublayer with residual VJP

assume: \(\varepsilon {\gt} 0\) [h\(\varepsilon \)]
prove: \(\mathsf{HasVJPMat}\) of \(X \mapsto X + \mathrm{MHSA}(\mathrm{LN}_1(X))\).

LaTeX Lean

Theorem 82Transformer block VJP

prove: \(\mathsf{HasVJPMat}\) of one pre-norm encoder block \(z = x + \mathrm{MHSA}(\mathrm{LN}_1 x)\), \(y = z + \mathrm{MLP}(\mathrm{LN}_2 z)\).

LaTeX Lean

Theorem 79Transformer MLP sublayer VJP

\(\mathsf{HasVJPMat}\) of \(\mathrm{dense}_2 \circ \mathrm{gelu} \circ \mathrm{dense}_1\), all per-token.

LaTeX Lean

Theorem 81Transformer MLP sublayer with residual VJP

assume: \(\varepsilon {\gt} 0\) [h\(\varepsilon \)]
prove: \(\mathsf{HasVJPMat}\) of \(h \mapsto h + \mathrm{MLP}(\mathrm{LN}_2(h))\).

LaTeX Lean

Theorem 83Transformer tower, any depth

prove: \(\mathsf{HasVJPMat}\) of the \(k\)-block tower, for every \(k\) — ViT-Tiny/Base (\(k = 12\)) and Large (\(k = 24\)) are instances.

LaTeX Lean

Theorem 60Transpose VJP

\(\mathsf{HasVJPMat}\, (\mathrm{transpose})\), backward \(dA = dY^{T}\).

LaTeX Lean

Theorem 84ViT body: the grand finale

prove: the full ViT transformer backbone \(\mathrm{finalLN} \circ \mathrm{transformerTower}\) is one \(\mathsf{HasVJPMat}\).

LaTeX Lean

Theorem 633D VJP chain rule

Rank-3 transcription of Theorem 10: composite backward \(B(x, dy) = B_f\bigl(x,\, B_g(f\, x,\, dy)\bigr)\).

LaTeX Lean

Theorem 10VJP chain rule

assume:

\(B_f\) is a correct backward function for \(f\) (\(\mathsf{HasVJP}\, f\)) [hf]
\(B_g\) is a correct backward function for \(g\) (\(\mathsf{HasVJP}\, g\)) [hg]
\(f\) is differentiable everywhere [hf_diff]
\(g\) is differentiable everywhere [hg_diff]

prove: \(\mathsf{HasVJP}\, (g \circ f)\).

LaTeX Lean

Theorem 50Matrix-level chain rule

assume:

\(B_F\), \(B_G\) are correct backward functions (\(\mathsf{HasVJPMat}\, F\), \(\mathsf{HasVJPMat}\, G\)) [hF, hG]
the flattened forms of \(F\) and \(G\) are differentiable everywhere [hF_diff, hG_diff]

prove: \(\mathsf{HasVJPMat}\, (G \circ F)\).

LaTeX Lean