- Boxes
- definitions
- Ellipses
- theorems and lemmas
- Blue border
- the statement of this result is ready to be formalized; all prerequisites are done
- Orange border
- the statement of this result is not ready to be formalized; the blueprint needs more work
- Blue background
- the proof of this result is ready to be formalized; all prerequisites are done
- Green border
- the statement of this result is formalized
- Green background
- the proof of this result is formalized
- Dark green background
- the proof of this result and all its ancestors are formalized
- Dark green border
- this is in Mathlib
DeepLab v3+’s marquee module (Chen et al. 2018). Five parallel branches emitting oc channels each: (1) 1\(\times \)1 conv, (2–4) 3\(\times \)3 atrous convs at dilation rates 6 / 12 / 18, (5) global avg-pool + 1\(\times \)1 conv + bilinear upsample. Concatenate, then a 1\(\times \)1 fusion conv back to oc. All branches include BN + ReLU. Atrous rates widen the effective receptive field without changing param count; the pool branch supplies image-level context. Signature: asppModule (ic oc : Nat).
ConvNeXt residual block \(\times \) nBlocks: 7\(\times \)7 depthwise conv \(\to \) LayerNorm \(\to \) 1\(\times \)1 expand (to \(4c\)) \(\to \) GELU \(\to \) 1\(\times \)1 project (back to \(c\)) + residual. The transformer-era CNN block (Liu et al. 2022). Does not include downsampling; see convNextDownsample. Signature: convNextStage (channels nBlocks : Nat).
CSP (Wang et al. 2019), used by YOLOv4 onward. Splits input into two halves, processes one half through a stack of residual blocks, then concatenates with the untouched half and 1\(\times \)1-projects to oc. The specific inner block varies across YOLO versions (C3 in v5, C2f in v8, C3k2 in v11); this primitive approximates all three at the same abstraction level. Signature: cspBlock (ic oc nBlocks : Nat).
YOLOv3’s Darknet-53 residual stack (Redmon & Farhadi 2018). nBlocks residual blocks at fixed channels, each being 1\(\times \)1 conv (\(c \to c/2\)) \(+\) 3\(\times \)3 conv (\(c/2 \to c\)) \(+\) residual add. Lighter than a standard ResNet bottleneck; heavier than a ResNet-18 basic block. Signature: darknetBlock (channels nBlocks : Nat).
Dual-representation (MSA + pair) joint update: MSA row-attention with pair bias + MSA column-attention + MSA transition + outer-product-mean (\(\to \) pair) + triangle multiplicative (outgoing, incoming) + triangle self-attention (starting node, ending node) + pair transition. Signature: evoformerBlock (msaChannels pairChannels nBlocks : Nat). The triangulation-aware operations are the key inductive bias.
Lin et al. 2017. Takes the four stage outputs of a CNN backbone (channels c2/c3/c4/c5 at strides \(4/8/16/32\)), projects each to target channels with a \(1\times 1\) lateral conv, merges them top-down (upsample-2\(\times \) then elementwise add), and applies a \(3\times 3\) smoothing conv at each merged level. Output: four feature maps, each target-wide, at the original spatial resolutions. Bundled because the cross-scale add doesn’t fit a linear NetSpec. Standard kit in 2-stage detectors (Mask R-CNN, Cascade R-CNN) and single-stage detectors (RetinaNet). Signature: fpnModule (c2 c3 c4 c5 target : Nat).
The GoogLeNet parallel-branch module (Szegedy et al. 2014). Four branches computed in parallel: 1\(\times \)1 conv (b1 channels); 1\(\times \)1 reduce then 3\(\times \)3 (b2); 1\(\times \)1 reduce then 5\(\times \)5 (b3); 3\(\times \)3 maxPool then 1\(\times \)1 (b4). Concat along channels for b1 + b2 + b3 + b4 outputs. The 1\(\times \)1 dimension reducers on branches 2 and 3 are the paper’s trick — they make the expensive 3\(\times \)3 and 5\(\times \)5 convs operate on reduced channel counts. Signature: inceptionModule (ic b1 b2reduce b2 b3reduce b3 b4 : Nat).
Selective state-space block in the S6 formulation (Gu & Dao 2023). Signature: mambaBlock (dim stateSize expand nBlocks : Nat). Bundles RMSNorm + linear expand + depthwise 1D conv + SiLU + selective-scan SSM + gated product + output projection, for nBlocks stacked layers. The selective scan is the novel primitive; everything else could be decomposed to existing layers if we cared to unpack the bundle.
Hybrid local-conv + patch-level transformer (Mehta & Rastegari 2022). Local 3\(\times \)3 conv \(\to \) 1\(\times \)1 projection to transformer dim \(\to \) unfold into patches \(\to \) \(L\) transformer blocks across patches \(\to \) fold back \(\to \) 1\(\times \)1 projection back \(\to \) concat with input \(\to \) 3\(\times \)3 fusion. Signature: mobileVitBlock (ic dim heads mlpDim nTxBlocks : Nat). The unfold/fold operations are pdiv_reindex-style shape transformations; all the genuinely new math is already covered by the transformer proof chapter.
The whole NeRF network (Mildenhall et al. 2020) bundled as one primitive: 8 hidden ReLU-FC layers of hiddenDim, mid-skip concatenating \(\gamma (x)\) at layer 5, dual output heads (1-dim volume density \(\sigma \) + 3-dim RGB via a direction-conditioned branch). Under 600K parameters at the canonical config. Signature: nerfMLP (encodedPosDim encodedDirDim hiddenDim : Nat).
Sinusoidal frequency basis (Vaswani 2017, reused by NeRF 2020): \(\gamma (p) = (\sin 2^0 \pi p, \cos 2^0 \pi p, \ldots , \sin 2^{L-1} \pi p, \cos 2^{L-1} \pi p)\). Zero trainable parameters — it’s a deterministic lift of a low-dim coordinate into a high-frequency feature space where an MLP has enough wiggle room to represent sharp details. Output dim \(=\) inputDim \(\cdot 2 \cdot \) numFrequencies. Signature: positionalEncoding (inputDim numFrequencies : Nat).
Grouped 1\(\times \)1 conv + channel-shuffle permutation + 3\(\times \)3 depthwise + grouped 1\(\times \)1 conv, for nUnits units (first downsampling, rest residual). The shuffle is parameter-free; grouping reduces 1\(\times \)1 cost by \(g\). Signature: shuffleBlock (ic oc groups nUnits : Nat).
nUnits v2 units (Ma et al. 2018). Basic unit (stride 1): channel-split \([X_1, X_2]\), leave \(X_1\) alone, run \(X_2\) through \(1\times 1 \to 3\times 3\) DW \(\to 1\times 1\) at half-width, concat, then channel-shuffle. Downsample unit (stride 2): both branches see the full input; left does DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1, right does 1\(\times \)1 \(+\) DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1; concat doubles channels. v2 throws out v1’s grouped 1\(\times \)1 convs (G2) and skip-add (G4) per the paper’s practical guidelines. Signature: shuffleV2Block (ic oc nUnits : Nat).
Recurrent Invariant Point Attention + backbone frame update + side-chain \(\chi \)-angle prediction. Weights shared across nBlocks rounds — param count does not multiply by nBlocks. Signature: structureModule (singleChannels pairChannels nBlocks : Nat).
nBlocks blocks of self-attention over nQueries learned object queries + cross-attention against an encoder output + FFN. The query embedding is part of the layer’s parameters. Signature: transformerDecoder (dim heads mlpDim nBlocks nQueries : Nat).
One stack of nLayers dilated causal residual blocks with doubling dilation rates \(2^0, 2^1, \ldots , 2^{\texttt{nLayers}-1}\) (van den Oord et al. 2016). Each block: dilated 2-tap causal conv \(\to \) gated activation \(\tanh (\text{filter}) \odot \sigma (\text{gate})\) \(\to \) 1\(\times \)1 project back to residualCh (residual path) + 1\(\times \)1 skip projection to skipCh. Skip outputs across blocks are summed into the final head. Signature: waveNetBlock (residualCh skipCh nLayers : Nat). Output channels are skipCh: bestiary convention picks the skip path as the "forward" output since it’s what feeds the final classifier.