- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- Dark green border
- this is in Mathlib
\(\operatorname {bnIstdBroadcast}\) is \(\mathsf{Differentiable}\) when \(\varepsilon {\gt} 0\). Proved via Differentiable.sqrt and Differentiable.inv over \(\mathrm{bnVar} + \varepsilon {\gt} 0\); captures the sqrt/recip smoothness of \(1/\sqrt{\sigma ^2 + \varepsilon }\) required by pdiv_mul inside pdiv_bnNormalize.
The partial derivative function. For \(f : \mathbb {R}^{m} \to \mathbb {R}^{n}\), \(\operatorname {pdiv}\, f\, x\, i\, j\) is the \((i, j)\)-entry of the Jacobian at \(x\), defined as \(\operatorname {fderiv}_{\mathbb {R}}\, f\, x\, (\mathbf{e}_i)\, j\) — the \(j\)-th coordinate of Mathlib’s Fréchet derivative applied to the \(i\)-th standard basis vector.
\(\partial \mathrm{istd}/\partial x_i = -\mathrm{istd}^3 \cdot (x_i - \mu )/n\). Proved via the centering \(\mathrm{ContinuousLinearMap}\), HasFDerivAt.sqrt (under \(\mathrm{bnVar} + \varepsilon {\gt} 0\)), and (hasDerivAt_inv).comp_hasFDerivAt; centered sum collapses by \(\sum _k (x_k - \mu ) = 0\).
Pdiv of \(\mathrm{ReLU}\) at smooth points (\(\forall k,\ x_k \neq 0\)). Proved via local-diagonal-CLM transport: at a smooth point ReLU agrees with the diagonal indicator CLM \(\Pi _k\, (\text{if } x_k {\gt} 0 \text{ then } \mathrm{proj}_k \text{ else } 0)\) on \(\mathrm{Metric.ball}\, x\, (\min |x_k|)\) (every coordinate keeps its sign), and HasFDerivAt.congr_of_eventuallyEq transports the CLM’s self-fderiv to ReLU.
\(\partial \mathrm{softmax}(z)_j / \partial z_i = p_j (\delta _{ij} - p_i)\). Proved by extracting the j-th coordinate function \(z' \mapsto e^{z'_j} \cdot (\sum _k e^{z'_k})^{-1}\) and chaining HasFDerivAt.exp with (hasDerivAt_inv).comp_hasFDerivAt. At \(\mathbf{e}_i\) the sum collapses via \(\sum _k e^{z_k} \cdot \delta _{ki} = e^{z_i}\).
Applying a vector function per-row to a matrix has block-diagonal Jacobian. Proved from \(\operatorname {fderiv}\) via the row-projection \(\mathrm{ContinuousLinearMap}\) and the chain rule, given a \(\mathsf{Differentiable}\) hypothesis on the per-row function.
\(\mathrm{rowSoftmax}\) (the flattened form) is \(\mathsf{Differentiable}\). Proved via Mathlib’s \(\mathrm{Real.exp}\) calculus: the denominator \(\sum _j \exp (M_{rj})\) is positive (by Finset.sum_pos), giving a Differentiable.inv application that finishes the chain.
\(\partial L / \partial z = \mathrm{softmax}(z) - \mathrm{onehot}(y)\). Proved by chain rule on \(-\log \circ \mathrm{softmax\_ label}\) using HasFDerivAt.log (with \(\mathrm{softmax}(z)[\, \ell \, ] {\gt} 0\) from exp positivity) composed with the proved \(\mathrm{softmax}\) Jacobian.
DeepLab v3+’s marquee module (Chen et al. 2018). Five parallel branches emitting oc channels each: (1) 1\(\times \)1 conv, (2–4) 3\(\times \)3 atrous convs at dilation rates 6 / 12 / 18, (5) global avg-pool + 1\(\times \)1 conv + bilinear upsample. Concatenate, then a 1\(\times \)1 fusion conv back to oc. All branches include BN + ReLU. Atrous rates widen the effective receptive field without changing param count; the pool branch supplies image-level context. Signature: asppModule (ic oc : Nat).
CSP (Wang et al. 2019), used by YOLOv4 onward. Splits input into two halves, processes one half through a stack of residual blocks, then concatenates with the untouched half and 1\(\times \)1-projects to oc. The specific inner block varies across YOLO versions (C3 in v5, C2f in v8, C3k2 in v11); this primitive approximates all three at the same abstraction level. Signature: cspBlock (ic oc nBlocks : Nat).
YOLOv3’s Darknet-53 residual stack (Redmon & Farhadi 2018). nBlocks residual blocks at fixed channels, each being 1\(\times \)1 conv (\(c \to c/2\)) \(+\) 3\(\times \)3 conv (\(c/2 \to c\)) \(+\) residual add. Lighter than a standard ResNet bottleneck; heavier than a ResNet-18 basic block. Signature: darknetBlock (channels nBlocks : Nat).
DenseNet’s bundled dense block (Huang et al. 2017). nLayers BN-ReLU-1\(\times \)1-BN-ReLU-3\(\times \)3 sub-stacks, each adding growthRate channels to the running concatenation of preceding outputs. Input has ic channels; output has ic + nLayers \(\cdot \) growthRate. Bundled because each sub-layer reads a growing concat of all preceding sub-layers, which doesn’t fit a linear NetSpec at sub-layer granularity. Signature: denseBlock (ic growthRate nLayers : Nat).
Dual-representation (MSA + pair) joint update: MSA row-attention with pair bias + MSA column-attention + MSA transition + outer-product-mean (\(\to \) pair) + triangle multiplicative (outgoing, incoming) + triangle self-attention (starting node, ending node) + pair transition. Signature: evoformerBlock (msaChannels pairChannels nBlocks : Nat). The triangulation-aware operations are the key inductive bias.
Lin et al. 2017. Takes the four stage outputs of a CNN backbone (channels c2/c3/c4/c5 at strides \(4/8/16/32\)), projects each to target channels with a \(1\times 1\) lateral conv, merges them top-down (upsample-2\(\times \) then elementwise add), and applies a \(3\times 3\) smoothing conv at each merged level. Output: four feature maps, each target-wide, at the original spatial resolutions. Bundled because the cross-scale add doesn’t fit a linear NetSpec. Standard kit in 2-stage detectors (Mask R-CNN, Cascade R-CNN) and single-stage detectors (RetinaNet). Signature: fpnModule (c2 c3 c4 c5 target : Nat).
The GoogLeNet parallel-branch module (Szegedy et al. 2014). Four branches computed in parallel: 1\(\times \)1 conv (b1 channels); 1\(\times \)1 reduce then 3\(\times \)3 (b2); 1\(\times \)1 reduce then 5\(\times \)5 (b3); 3\(\times \)3 maxPool then 1\(\times \)1 (b4). Concat along channels for b1 + b2 + b3 + b4 outputs. The 1\(\times \)1 dimension reducers on branches 2 and 3 are the paper’s trick — they make the expensive 3\(\times \)3 and 5\(\times \)5 convs operate on reduced channel counts. Signature: inceptionModule (ic b1 b2reduce b2 b3reduce b3 b4 : Nat).
Selective state-space block in the S6 formulation (Gu & Dao 2023). Signature: mambaBlock (dim stateSize expand nBlocks : Nat). Bundles RMSNorm + linear expand + depthwise 1D conv + SiLU + selective-scan SSM + gated product + output projection, for nBlocks stacked layers. The selective scan is the novel primitive; everything else could be decomposed to existing layers if we cared to unpack the bundle.
Hybrid local-conv + patch-level transformer (Mehta & Rastegari 2022). Local 3\(\times \)3 conv \(\to \) 1\(\times \)1 projection to transformer dim \(\to \) unfold into patches \(\to \) \(L\) transformer blocks across patches \(\to \) fold back \(\to \) 1\(\times \)1 projection back \(\to \) concat with input \(\to \) 3\(\times \)3 fusion. Signature: mobileVitBlock (ic dim heads mlpDim nTxBlocks : Nat). The unfold/fold operations are pdiv_reindex-style shape transformations; all the genuinely new math is already covered by the transformer proof chapter.
The whole NeRF network (Mildenhall et al. 2020) bundled as one primitive: 8 hidden ReLU-FC layers of hiddenDim, mid-skip concatenating \(\gamma (x)\) at layer 5, dual output heads (1-dim volume density \(\sigma \) + 3-dim RGB via a direction-conditioned branch). Under 600K parameters at the canonical config. Signature: nerfMLP (encodedPosDim encodedDirDim hiddenDim : Nat).
Sinusoidal frequency basis (Vaswani 2017, reused by NeRF 2020): \(\gamma (p) = (\sin 2^0 \pi p, \cos 2^0 \pi p, \ldots , \sin 2^{L-1} \pi p, \cos 2^{L-1} \pi p)\). Zero trainable parameters — it’s a deterministic lift of a low-dim coordinate into a high-frequency feature space where an MLP has enough wiggle room to represent sharp details. Output dim \(=\) inputDim \(\cdot 2 \cdot \) numFrequencies. Signature: positionalEncoding (inputDim numFrequencies : Nat).
Grouped 1\(\times \)1 conv + channel-shuffle permutation + 3\(\times \)3 depthwise + grouped 1\(\times \)1 conv, for nUnits units (first downsampling, rest residual). The shuffle is parameter-free; grouping reduces 1\(\times \)1 cost by \(g\). Signature: shuffleBlock (ic oc groups nUnits : Nat).
nUnits v2 units (Ma et al. 2018). Basic unit (stride 1): channel-split \([X_1, X_2]\), leave \(X_1\) alone, run \(X_2\) through \(1\times 1 \to 3\times 3\) DW \(\to 1\times 1\) at half-width, concat, then channel-shuffle. Downsample unit (stride 2): both branches see the full input; left does DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1, right does 1\(\times \)1 \(+\) DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1; concat doubles channels. v2 throws out v1’s grouped 1\(\times \)1 convs (G2) and skip-add (G4) per the paper’s practical guidelines. Signature: shuffleV2Block (ic oc nUnits : Nat).
Recurrent Invariant Point Attention + backbone frame update + side-chain \(\chi \)-angle prediction. Weights shared across nBlocks rounds — param count does not multiply by nBlocks. Signature: structureModule (singleChannels pairChannels nBlocks : Nat).
nBlocks blocks of self-attention over nQueries learned object queries + cross-attention against an encoder output + FFN. The query embedding is part of the layer’s parameters. Signature: transformerDecoder (dim heads mlpDim nBlocks nQueries : Nat).
Inter-block downsample for DenseNet. BN + 1\(\times \)1 conv (ic \(\to \) oc) + 2\(\times \)2 average-pool stride 2. Halves spatial resolution and (typically) halves channel count via the 1\(\times \)1 conv to compress feature reuse between dense blocks. Signature: transitionLayer (ic oc : Nat).
One stack of nLayers dilated causal residual blocks with doubling dilation rates \(2^0, 2^1, \ldots , 2^{\texttt{nLayers}-1}\) (van den Oord et al. 2016). Each block: dilated 2-tap causal conv \(\to \) gated activation \(\tanh (\text{filter}) \odot \sigma (\text{gate})\) \(\to \) 1\(\times \)1 project back to residualCh (residual path) + 1\(\times \)1 skip projection to skipCh. Skip outputs across blocks are summed into the final head. Signature: waveNetBlock (residualCh skipCh nLayers : Nat). Output channels are skipCh: bestiary convention picks the skip path as the "forward" output since it’s what feeds the final classifier.
noncomputable def over the canonical pdiv-derived witness; HasVJP.correct holds by rfl since \(\operatorname {pdiv}\) is a def over \(\operatorname {fderiv}\). At non-smooth points the canonical backward is \(\operatorname {fderiv}\)’s junk default of \(0\); the codegen substitutes the standard subgradient convention.