11 Bestiary of Architectures
The proof chapters above cover every layer shipped in the Phase 1–3 trainers (MLP, CNN, ResNet, MobileNet, EfficientNet, ViT). The bestiary is the book’s Part 2: a catalogue of famous architectures expressed as pure NetSpec values — no training runs, no VJP commitments, just “here’s what this architecture looks like in \(\sim \)15 lines of Lean.”
Each bestiary entry introduces zero or one new Layer constructor for its architectural idiom. The primitive is bundled at the right abstraction level (a single Transformer encoder block, a whole Mamba block, a stage of Swin attention, etc.) — formalizing every sub-op of a modern architecture would be an asymptote we don’t approach.
11.1 Bestiary-only Layer primitives
Ten new Layer constructors were added by the bestiary chapters. None are codegen-backed (MlirCodegen emits // UNSUPPORTED); the goal is pedagogical shape + parameter accounting, not training runs.
Selective state-space block in the S6 formulation (Gu & Dao 2023). Signature: mambaBlock (dim stateSize expand nBlocks : Nat). Bundles RMSNorm + linear expand + depthwise 1D conv + SiLU + selective-scan SSM + gated product + output projection, for nBlocks stacked layers. The selective scan is the novel primitive; everything else could be decomposed to existing layers if we cared to unpack the bundle.
Windowed multi-head self-attention at fixed spatial resolution (Liu et al. 2021). Signature: swinStage (dim heads mlpDim windowSize nBlocks : Nat). Internal blocks alternate W-MSA and SW-MSA (shifted-window) to let information cross window boundaries.
Swin’s 2\(\times \)2 spatial downsample + linear channel projection (inDim \(\to \) outDim). Transformer-side analog of a stride-2 conv.
\(2 \times \) (conv3\(\times \)3 + BN + ReLU) then maxPool-2. Saves its pre-pool activation as a skip for the matching unetUp. Signature: unetDown (ic oc : Nat).
Transposed-conv \(2\times \) upsample + concat with matching skip + \(2 \times \) (conv3\(\times \)3 + BN + ReLU). Signature: unetUp (ic oc : Nat), where oc is both the output channel count and the expected skip width.
nBlocks blocks of self-attention over nQueries learned object queries + cross-attention against an encoder output + FFN. The query embedding is part of the layer’s parameters. Signature: transformerDecoder (dim heads mlpDim nBlocks nQueries : Nat).
Per-query class head (linear dim \(\to \) nClasses+1 with the “no object” slot) + box head (3-layer MLP to 4 scalars (cx, cy, w, h)). Signature: detrHeads (dim nClasses : Nat).
Grouped 1\(\times \)1 conv + channel-shuffle permutation + 3\(\times \)3 depthwise + grouped 1\(\times \)1 conv, for nUnits units (first downsampling, rest residual). The shuffle is parameter-free; grouping reduces 1\(\times \)1 cost by \(g\). Signature: shuffleBlock (ic oc groups nUnits : Nat).
nUnits v2 units (Ma et al. 2018). Basic unit (stride 1): channel-split \([X_1, X_2]\), leave \(X_1\) alone, run \(X_2\) through \(1\times 1 \to 3\times 3\) DW \(\to 1\times 1\) at half-width, concat, then channel-shuffle. Downsample unit (stride 2): both branches see the full input; left does DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1, right does 1\(\times \)1 \(+\) DW-3\(\times \)3-stride-2 \(+\) 1\(\times \)1; concat doubles channels. v2 throws out v1’s grouped 1\(\times \)1 convs (G2) and skip-add (G4) per the paper’s practical guidelines. Signature: shuffleV2Block (ic oc nUnits : Nat).
Dual-representation (MSA + pair) joint update: MSA row-attention with pair bias + MSA column-attention + MSA transition + outer-product-mean (\(\to \) pair) + triangle multiplicative (outgoing, incoming) + triangle self-attention (starting node, ending node) + pair transition. Signature: evoformerBlock (msaChannels pairChannels nBlocks : Nat). The triangulation-aware operations are the key inductive bias.
Recurrent Invariant Point Attention + backbone frame update + side-chain \(\chi \)-angle prediction. Weights shared across nBlocks rounds — param count does not multiply by nBlocks. Signature: structureModule (singleChannels pairChannels nBlocks : Nat).
Hybrid local-conv + patch-level transformer (Mehta & Rastegari 2022). Local 3\(\times \)3 conv \(\to \) 1\(\times \)1 projection to transformer dim \(\to \) unfold into patches \(\to \) \(L\) transformer blocks across patches \(\to \) fold back \(\to \) 1\(\times \)1 projection back \(\to \) concat with input \(\to \) 3\(\times \)3 fusion. Signature: mobileVitBlock (ic dim heads mlpDim nTxBlocks : Nat). The unfold/fold operations are pdiv_reindex-style shape transformations; all the genuinely new math is already covered by the transformer proof chapter.
ConvNeXt residual block \(\times \) nBlocks: 7\(\times \)7 depthwise conv \(\to \) LayerNorm \(\to \) 1\(\times \)1 expand (to \(4c\)) \(\to \) GELU \(\to \) 1\(\times \)1 project (back to \(c\)) + residual. The transformer-era CNN block (Liu et al. 2022). Does not include downsampling; see convNextDownsample. Signature: convNextStage (channels nBlocks : Nat).
Inter-stage spatial halving \(+\) channel-doubling: LayerNorm \(+\) 2\(\times \)2 conv stride 2. Swin’s patchMerging for CNNs. Signature: convNextDownsample (ic oc : Nat).
One stack of nLayers dilated causal residual blocks with doubling dilation rates \(2^0, 2^1, \ldots , 2^{\texttt{nLayers}-1}\) (van den Oord et al. 2016). Each block: dilated 2-tap causal conv \(\to \) gated activation \(\tanh (\text{filter}) \odot \sigma (\text{gate})\) \(\to \) 1\(\times \)1 project back to residualCh (residual path) + 1\(\times \)1 skip projection to skipCh. Skip outputs across blocks are summed into the final head. Signature: waveNetBlock (residualCh skipCh nLayers : Nat). Output channels are skipCh: bestiary convention picks the skip path as the "forward" output since it’s what feeds the final classifier.
Sinusoidal frequency basis (Vaswani 2017, reused by NeRF 2020): \(\gamma (p) = (\sin 2^0 \pi p, \cos 2^0 \pi p, \ldots , \sin 2^{L-1} \pi p, \cos 2^{L-1} \pi p)\). Zero trainable parameters — it’s a deterministic lift of a low-dim coordinate into a high-frequency feature space where an MLP has enough wiggle room to represent sharp details. Output dim \(=\) inputDim \(\cdot 2 \cdot \) numFrequencies. Signature: positionalEncoding (inputDim numFrequencies : Nat).
The whole NeRF network (Mildenhall et al. 2020) bundled as one primitive: 8 hidden ReLU-FC layers of hiddenDim, mid-skip concatenating \(\gamma (x)\) at layer 5, dual output heads (1-dim volume density \(\sigma \) + 3-dim RGB via a direction-conditioned branch). Under 600K parameters at the canonical config. Signature: nerfMLP (encodedPosDim encodedDirDim hiddenDim : Nat).
YOLOv3’s Darknet-53 residual stack (Redmon & Farhadi 2018). nBlocks residual blocks at fixed channels, each being 1\(\times \)1 conv (\(c \to c/2\)) \(+\) 3\(\times \)3 conv (\(c/2 \to c\)) \(+\) residual add. Lighter than a standard ResNet bottleneck; heavier than a ResNet-18 basic block. Signature: darknetBlock (channels nBlocks : Nat).
CSP (Wang et al. 2019), used by YOLOv4 onward. Splits input into two halves, processes one half through a stack of residual blocks, then concatenates with the untouched half and 1\(\times \)1-projects to oc. The specific inner block varies across YOLO versions (C3 in v5, C2f in v8, C3k2 in v11); this primitive approximates all three at the same abstraction level. Signature: cspBlock (ic oc nBlocks : Nat).
Lin et al. 2017. Takes the four stage outputs of a CNN backbone (channels c2/c3/c4/c5 at strides \(4/8/16/32\)), projects each to target channels with a \(1\times 1\) lateral conv, merges them top-down (upsample-2\(\times \) then elementwise add), and applies a \(3\times 3\) smoothing conv at each merged level. Output: four feature maps, each target-wide, at the original spatial resolutions. Bundled because the cross-scale add doesn’t fit a linear NetSpec. Standard kit in 2-stage detectors (Mask R-CNN, Cascade R-CNN) and single-stage detectors (RetinaNet). Signature: fpnModule (c2 c3 c4 c5 target : Nat).
DeepLab v3+’s marquee module (Chen et al. 2018). Five parallel branches emitting oc channels each: (1) 1\(\times \)1 conv, (2–4) 3\(\times \)3 atrous convs at dilation rates 6 / 12 / 18, (5) global avg-pool + 1\(\times \)1 conv + bilinear upsample. Concatenate, then a 1\(\times \)1 fusion conv back to oc. All branches include BN + ReLU. Atrous rates widen the effective receptive field without changing param count; the pool branch supplies image-level context. Signature: asppModule (ic oc : Nat).
The GoogLeNet parallel-branch module (Szegedy et al. 2014). Four branches computed in parallel: 1\(\times \)1 conv (b1 channels); 1\(\times \)1 reduce then 3\(\times \)3 (b2); 1\(\times \)1 reduce then 5\(\times \)5 (b3); 3\(\times \)3 maxPool then 1\(\times \)1 (b4). Concat along channels for b1 + b2 + b3 + b4 outputs. The 1\(\times \)1 dimension reducers on branches 2 and 3 are the paper’s trick — they make the expensive 3\(\times \)3 and 5\(\times \)5 convs operate on reduced channel counts. Signature: inceptionModule (ic b1 b2reduce b2 b3reduce b3 b4 : Nat).
11.2 Bestiary entries
The entries are grouped by task domain. The first block (vision classifiers) is where Part 1’s VJP’d primitives show up at real-world scale — every layer is one you’ve already seen proved correct. Subsequent blocks step out to detection, segmentation, reinforcement learning, and the non-vision outliers (language, audio, 3D, multimodal, science).
11.2.1 Vision classifiers — Part 1’s primitives at scale
Image-classification backbones built out of conv / pool / batch-norm / residual / attention / patch-embed — the exact layer kit VJP’d in Part 1. If a chapter of Part 1 proved it, a bestiary entry below puts it to work.
LeNet (Bestiary/LeNet.lean)
The 1998 original. Zero new primitives — just conv2d \(+\) maxPool \(+\) dense. Variants: LeNet-5 (61K params, the canonical CNN) and LeNet-300-100 (266K params, pure-MLP baseline). Historical importance; still the ground-truth pattern every later CNN riffs on.
AlexNet (Bestiary/AlexNet.lean)
Zero new primitives — five convs, three FCs, pools. The 2012 ImageNet winner that restarted modern deep learning. Variants: AlexNet (62M, paper-exact at 60M) + tiny CIFAR fixture. \(\sim \)58M of the 62M live in the three FC layers, which is precisely why every post-2015 CNN dropped FC stacks for globalAvgPool + one final dense. LRN is omitted (replaced in the field by BatchNorm in 2015); dropout is training-time only.
SqueezeNet (Bestiary/SqueezeNet.lean)
Zero new primitives — the .fireModule constructor was already in Types.lean. AlexNet-level accuracy in 1.25M params via the fire module: squeeze 1\(\times \)1 conv followed by parallel expand 1\(\times \)1 + 3\(\times \)3 convs concatenated. The early efficiency-CNN family alongside MobileNet and ShuffleNet. Variants: SqueezeNet 1.0 (1.25M, paper-exact), 1.1 (1.24M — earlier downsample), tiny fixture.
Inception v1 / v3 / v4 (Bestiary/Inception.lean)
Uses § 97. GoogLeNet’s parallel multi-scale conv-branches concatenated along channels; the 1\(\times \)1 dimension reducer was invented here. Variants: GoogLeNet (7M, paper-exact), Inception-v3 (23.8M, paper 23M), Inception-v4 (33M, paper 42M — a bit low because our unified module approximates v3/v4’s richer module catalog). Auxiliary classifiers (v1) and asymmetric factorizations (v3/v4) are bestiary omissions.
Xception (Bestiary/Xception.lean)
Zero new primitives — reuses existing .separableConv. “Extreme Inception”: every conv is a depthwise-separable conv. The design choice that made MobileNet possible a year later. Variants: Xception (21.9M, paper-exact at 22M), tiny fixture. Residual skips around each block-of-three-sep-convs are implicit in the linear NetSpec.
ShuffleNet (Bestiary/ShuffleNet.lean)
Uses § 83. Variants: ShuffleNet 0.5\(\times \) / 1.0\(\times \) / 2.0\(\times \) at \(g=3\), plus a tiny fixture.
ShuffleNet v2 (Bestiary/ShuffleNetV2.lean)
Uses § 84. The efficient-CNN paper that called out FLOPs as a bad latency proxy: measured memory-access cost (MAC) directly and derived four practical guidelines (equal channel widths, avoid grouped convs, avoid fragmentation, avoid element-wise ops). v2’s architecture throws out everything in v1 that violated those rules — no grouped 1\(\times \)1 convs, no skip-add — in favor of channel-split + identity + concat + channel-shuffle. Variants: 0.5\(\times \) (1.37M, paper 1.4M), 1.0\(\times \) (2.29M, paper 2.3M), 1.5\(\times \) (3.52M, paper 3.5M), 2.0\(\times \) (7.41M, paper-exact), tiny fixture. All widths within 2% of paper.
MobileViT (Bestiary/MobileViT.lean)
Uses § 87 plus existing invertedResidual for the MV2 stages. Hybrid mobile backbone: MobileNet V2 body with MobileViT blocks replacing some of the deeper inverted-residual stages. Variants: MobileViT-S (5.6M, paper-exact), XS (2.3M), XXS (1.3M), tiny fixture.
ConvNeXt (Bestiary/ConvNeXt.lean)
Uses § 88 and § 89. ResNet-50 modernized with transformer design choices: 7\(\times \)7 depthwise + inverted bottleneck + LN + GELU + separate inter-stage downsamples. Proof-by-construction that CNNs didn’t need to die in 2021. Variants: T (28M), S (50M), B (89M), L (198M), tiny fixture — all matching the paper within 1%.
Swin Transformer (Bestiary/SwinT.lean)
Uses § 77 and § 78. Variants: Swin-T / -S / -B / tiny. Swin-T lands at 28M params matching the paper exactly.
11.2.2 Object detection
Localize and classify. Detection heads are where the linear NetSpec shape starts to creak: multi-scale FPN outputs need a graph, not a list. The bestiary entries show the single-scale view and defer the multi-head refactor to the limitations discussion.
YOLO v1/v3/v5/v8/v11 (Bestiary/YOLO.lean)
v1 uses zero new primitives (conv2d + maxPool + flatten + dense; the YOLO-ness lives in the loss and output reshape, not the architecture). v3 adds § 93 for the Darknet-53 body. v5/v8/v11 use § 94 for their CSPDarknet backbones. Variants: YOLOv1 (271M, paper-exact), fast, tiny, YOLOv3 (40M backbone), YOLOv5s/m (3M/8M single-scale), YOLOv8n/s (0.7M/2.9M single-scale), YOLOv11n/m (backbone only). Multi-scale FPN detection heads don’t linearize — same skip-connection problem as UNet; all entries show a single-scale view.
Mask R-CNN (Bestiary/MaskRCNN.lean)
Uses § 95. The canonical two-stage detector + instance segmentation reference (He et al. 2017). Five architectural pieces: ResNet-FPN backbone, RPN for anchor-box proposals, ROI-Align for per-proposal feature extraction (an orchestration step, not a layer), a 2-layer FC box head for classification + bbox regression, and a 4-conv + transposed-conv mask head for per-class \(28 \times 28\) masks. Shown as separate NetSpecs per head (SAM-style decomposition). Param totals: backbone+FPN 45.8M, box head 14.3M, mask head 2.6M, RPN 0.6M — \(\sim \)63M in total, matching paper. The FPN cross-scale add is the one thing that demanded a new bundled primitive; everything else reuses existing ResNet / conv / dense primitives. DETR is its end-to-end transformer-era cousin; Mask2Former is the DETR-style instance-segmentation successor.
DETR (Bestiary/DETR.lean)
Uses § 81 and § 82, plus existing bottleneckBlock (ResNet backbone), patchEmbed (absorbs the DETR 1\(\times \)1 channel reduce), and transformerEncoder. Variants: DETR-R50 (41M, paper-exact), DETR-R101 (60M), tiny.
11.2.3 Semantic segmentation
Pixel-level labeling. Symmetric encoder/decoder with skip connections is the recurring pattern; UNet is the canonical instance, and the same shape later became the diffusion-model backbone.
UNet (Bestiary/UNet.lean)
Uses § 79 and § 80. Skip connections are implicit — the \(i\)-th unetUp from the bottom pairs with the \(i\)-th unetDown from the top. Variants: original (1-channel, 2-class), RGB, small, tiny. Original lands at 31M params matching Ronneberger.
DeepLab v3+ (Bestiary/DeepLabV3Plus.lean)
Uses § 96. The pre-transformer segmentation workhorse (Chen et al. 2018) — still deployed widely in remote sensing, medical imaging, and autonomous-driving perception pipelines. Two ideas: atrous (dilated) convolutions in the backbone’s last stage (param-count-free receptive-field expansion) + an ASPP module for dense multi-scale context at the deepest feature resolution. The “+” in v3+ adds a lightweight decoder that upsamples the ASPP output 4\(\times \) and concatenates a low-level skip from backbone stage 2. Variants: ResNet-101 backbone (59M, paper \(\sim \)63M), MobileNet v2 backbone (5.7M, paper \(\sim \)6M mobile variant), tinyDeepLab fixture. The ASPP skip-to-stage-2 in the decoder doesn’t linearize cleanly; same hack as UNet / WaveNet use. SegFormer (next entry) argues “do ASPP’s multi-scale context via a transformer pyramid”; different mechanism, same goal.
SegFormer (Bestiary/SegFormer.lean)
Zero new primitives. Semantic segmentation via a hierarchical transformer pyramid (MiT encoder: 4 stages of .transformerEncoder glued by .patchMerging) and a lightweight MLP decoder (a handful of .dense calls). The decoder stays trivially small across all B0–B5 encoder sizes — the design argument of the paper is precisely that a good pretrained transformer feature pyramid makes the segmentation head cheap. Variants: MiT-B0 encoder (2.6M, paper 3.7M), B2 (19M, paper 25M), B5 (61M, paper 82M), shared MLP decoder (1M single-scale approx), tiny fixture. Uniform \(\sim \)25% undercount across all encoder sizes because real MiT uses depthwise convs inside the FFN and overlapping patch embeddings at inter-stage transitions; our .transformerEncoder and .patchMerging approximate the shape without those details. Comparison to DeepLab v3+: SegFormer’s decoder is \(\sim \)3M params across all sizes, DeepLab’s ASPP module is \(\sim \)15M and needs per-receptive-field tuning.
SAM (Bestiary/SAM.lean)
Zero new primitives. Promptable segmentation (Kirillov et al. 2023): a ViT image encoder runs once per image, a tiny prompt encoder tokenizes clicks / boxes / masks, a lightweight transformer mask decoder cross-attends between image tokens, prompt tokens, and a handful of learned output queries. Image encoder is .patchEmbed + .transformerEncoder (the ViT kit); mask decoder is a small .transformerDecoder (from DETR, 4 queries, 2 blocks). Variants: SAM ViT-B encoder (88M, paper total 91M), ViT-L (307M, paper 308M), ViT-H (635M, paper 636M), shared 3.3M mask decoder, tinySAM fixture. The image encoder accounts for \(\sim \)99% of each variant’s parameter budget, which is why EfficientSAM (Xiong et al. 2023) focused its distillation there.
11.2.4 Image generation
Networks whose output is a novel image. Backbones overlap heavily with segmentation (UNet again), but the training objective is generative: denoise a random input until it becomes a plausible sample.
VAE (Bestiary/VAE.lean)
Zero new primitives. The classical variational autoencoder (Kingma & Welling 2013): encoder outputs \((\mu , \log \sigma ^2)\) (represented as a single tensor with doubled final width), the reparameterization trick \(z = \mu + \sigma \odot \epsilon \) samples \(z\) in training code, the decoder reconstructs. KL divergence between the learned latent distribution and a standard normal is the regularizer. Variants: MNIST MLP VAE (20-dim latent, textbook example), CIFAR conv VAE (\(4 \times 4 \times 4\) spatial latent, SD-style), tiny fixture. All shown as encoder + decoder pairs. Training-code details (the actual sampling step, the KL loss) live outside the NetSpec. The same architectural template scales up to Stable Diffusion’s VAE (see StableDiffusion.lean); 2013’s 20-dim MNIST latent and 2022’s \(64 \times 64 \times 4\) SD latent are the same idea at different scales. VQ-VAE (discrete-codebook variant) and \(\beta \)-VAE (scaled KL) are mentioned in the prose notes.
DCGAN (Bestiary/DCGAN.lean)
Zero new primitives. Deep Convolutional GAN (Radford et al. 2015) — the paper that made GAN training reliably work. Eight design guidelines (strided convs instead of pooling, BN everywhere except \(G\)’s output / \(D\)’s input, no hidden FC layers, ReLU in \(G\) / LeakyReLU in \(D\), Adam with specific hyperparams) that became the default for every GAN paper since. Three NetSpecs: noise projector (dense \(100 \to 4 \times 4 \times 1024\), 1.65M), generator convs (11M, paper \(\sim \)12M), discriminator (11M, paper-exact). Transposed convs in \(G\) are approximated by standard convs of matching kernel and channels (same params, spatial doubling is forward-pass-only); same hack appears in VAE and Stable Diffusion entries. GAN training dynamics (mode collapse, equilibrium stability) are all training-procedure concerns living outside the NetSpec.
Pix2Pix (Bestiary/Pix2Pix.lean)
Zero new primitives. The paired-data ancestor of CycleGAN, from the same lab 9 months earlier (Isola et al. 2017). UNet generator (8 levels, \(\sim \)70M approx vs paper \(\sim \)54M — our .unetDown / .unetUp use 2 convs per level where Pix2Pix uses 1 strided conv, so we overcount) + PatchGAN discriminator (identical to CycleGAN’s, 2.8M). Trained with GAN loss + L1 reconstruction — the L1 term is a direct supervision signal that exists only because pairs exist. When you don’t have pairs you fall back to CycleGAN’s cycle-consistency trick. Hardware context: 2016–2017, 54M UNet on \(256 \times 256\) images meant batch size 1 on a GTX 1080 Ti, which is how InstanceNorm became the default normalizer in this lineage — it was what fit.
CycleGAN (Bestiary/CycleGAN.lean)
Zero new primitives. Unpaired image translation (Zhu et al. 2017): two generators \(G : X \to Y\) and \(F : Y \to X\) plus two PatchGAN discriminators, trained with adversarial loss + cycle consistency loss \(\| F(G(x)) - x\| _1\). The cycle constraint is what makes unpaired training work — without it, \(G\) could mode-collapse every \(x\) to one target. Generator is a Johnson-style ResNet: convs down + 9 .residualBlocks at 256 channels + convs up (11.4M, paper \(\sim \)11M). Discriminator is a PatchGAN: 5 strided convs with a \(70 \times 70\) receptive field, outputs an \(N \times N\) grid of patch real/fake logits (2.8M, paper \(\sim \)2.8M). Four-network pattern shown as two specs (\(G\) and \(D\)); the other \(F\) and \(D_X\) are architecturally identical copies. The “one clever loss” does the work — the architecture is quite ordinary.
DDPM (Bestiary/Diffusion.lean)
Zero new primitives. The denoiser is a UNet — literally the same § 79 and § 80 Ronneberger shipped in 2015. Everything “diffusion” lives in the training loop: a forward noise schedule adds Gaussian noise over \(T\) steps, a reverse process learns to predict the added noise, sampling iterates the reverse from pure noise back to a clean image. None of that is a layer. The timestep conditioning MLP (sinusoidal \(\to \) dense \(\to \) dense) is shown as a standalone NetSpec that reuses positionalEncoding from NeRF. Variants: CIFAR config (32x32 backbone), 256x256 high-res config, tiny fixture, timestep embed. Our simplified backbone undercounts vs paper (paper DDPM adds residual blocks, GroupNorm, attention-at-low-res, and per-block time-embedding projection on top of the UNet); the spec’s value is showing the architectural shape, not an exact param match. The lesson is the same one CLIP and NeRF taught: the novelty lives in the training procedure, not in layer design.
Stable Diffusion (Bestiary/StableDiffusion.lean)
Zero new primitives. The paper that made generative image models consumer-reachable (Rombach et al. 2022). Two architectural moves over DDPM, each individually small: (a) latent diffusion — run the diffusion process on 64\(\times \)64\(\times \)4 VAE latents instead of 512\(\times \)512 pixels, cutting spatial work \(\sim \)64\(\times \); (b) text conditioning via cross-attention — at each interior UNet resolution, insert a spatial transformer block that cross-attends from image tokens to CLIP text embeddings. Shown as six separate NetSpecs: VAE encoder, VAE decoder, CLIP text encoder (123M, matches CLIP ViT-L/14 exactly), UNet backbone (202M backbone approx; real SD 1.5 UNet is 865M, missing \(\sim \)650M is the interleaved cross-attention), an explicit spatial-transformer-block spec (uses .transformerDecoder with nQueries = 0 — same primitive Whisper’s decoder uses, same mechanism as DETR’s decoder applied at image-feature resolution), and a tiny end-to-end fixture. Three components are pretrained and frozen during SD’s main training (VAE + text encoder) or constrained-frozen (CLIP); only the UNet trains. SDXL scales the UNet to 2.6B; SD 3 switches the UNet for a DiT transformer. The latent-diffusion plus text-conditioning template is the same.
11.2.5 Reinforcement learning
Two-headed (policy + value) networks wrapped in a self-play + MCTS outer loop. The architectural side is a stack of residual CNN blocks; the complexity lives in the outer loop, not the network.
AlphaGo (Bestiary/AlphaGo.lean)
Zero new primitives. The original 2016 Lee Sedol system. Three separate networks: a 13-layer conv policy network (\(\sim \)3.9M) trained first on 30M human games then fine-tuned via self-play, a 13-layer conv + FC-head value network (\(\sim \)4M), and a shallow linear rollout policy used inside MCTS for fast tree playouts. Input is 48 hand-crafted Go feature planes (liberties, ladder patterns, 3\(\times \)3 stone arrangements, etc.). AlphaGo Zero (next entry) throws all of this out — raw board only, one two-headed net, self-play only — and plays better (5185 Elo vs 3140). The one-sentence lesson is written into every follow-up paper: features the network can learn on its own, it will.
AlphaZero (Bestiary/AlphaZero.lean)
Uses existing primitives only (convBn + residualBlock + conv2d + dense). Two-headed (policy + value), expressed as two separate NetSpec values sharing the body in prose. Variants: AlphaGo Zero (Go), AlphaZero chess, tiny fixture.
MuZero (Bestiary/MuZero.lean)
Zero new primitives — three ResNet-style networks (representation, dynamics, prediction) reusing convBn + residualBlock + dense. The architectural novelty is the three-network factoring, not any single layer type. Five NetSpec values per variant (rep, dyn body, dyn reward head, pred policy, pred value). Variants: Go, Atari (representation only), tiny.
11.2.6 Beyond vision
Architectures where the task domain is not a 2D image: language, audio, 3D scene reconstruction, multimodal embedding, scientific. Several of these (NeRF, CLIP) have essentially no architectural novelty — the interesting work lives in the data, loss, or training procedure, and the bestiary entry exists to make that point.
Mamba (Bestiary/Mamba.lean)
Uses § 76. Variants: Mamba-130M / 370M / 790M / tiny, matching Gu & Dao’s param counts within \(\sim \)5%.
BERT / RoBERTa (Bestiary/BERT.lean)
Zero new primitives — uses .transformerEncoder, same kit as ViT / DETR, plus .dense vocab\(\to \)dim standing in for the token-embedding table (faithful param count, shape semantics cheat since a linear NetSpec can’t express the \(L \to L \times D\) lookup). Variants: BERT-base (109M, paper 110M), BERT-large (335M, paper 340M), RoBERTa-base (124M, paper 125M), RoBERTa-large (355M, paper-exact), tinyBERT fixture. The architectural lesson is that RoBERTa = BERT; all RoBERTa gains came from training procedure (dynamic masking, more data, bigger batches, dropped NSP) — none of which lives in the NetSpec.
GPT-1 / GPT-2 (Bestiary/GPT.lean)
Zero new primitives — the decoder-only counterpart of BERT. Same .transformerEncoder, now read as a stack of decoder blocks with a causal attention mask (a training-time detail, not a parameter). No pooler; GPT-2 uses pre-norm instead of BERT’s post-norm (zero-parameter swap). Weight tying: the LM head reuses the token-embedding matrix, so our .dense vocab\(\to \)D stand-in already pays for both input and output sides. Variants: GPT-1 (116M, paper 117M), GPT-2 small (123M, paper 124M), medium (353M), large (772M, paper 774M), XL (1.56B, paper 1.5B), tinyGPT fixture. Reference implementation: Karpathy’s nanoGPT (\(\sim \)300 lines of PyTorch) targets GPT-2 small and is the canonical mental model.
QANet (Bestiary/QANet.lean)
Zero new primitives. The 2018 reading-comprehension architecture (Yu et al.) that killed the BiLSTM for SQuAD-style tasks. Core contribution: an encoder block combining 4 depthwise- separable convs (local context) with a self-attention + FFN transformer block (global context) — the “conv + attention hybrid” shape MobileViT and ConvNeXt rediscovered in 2022. Expressed with primitives we already have: 4 .separableConv 128 128 1 calls + 1 .transformerEncoder 128 8 512 1 per block. Per-block count: \(\sim \)270K; the paper’s model encoder stack (7 blocks) lands at \(\sim \)1.9M, repeated 3 times in the full architecture. The BiDAF- style context-query attention and character/word embedding tables are omitted from the spec; described in prose. QANet’s headline number was 3–4\(\times \) training speedup over BiLSTM-based competitors, a clean example of hardware (parallelization of convs vs. sequential RNN roll-outs) forcing an architectural choice. BERT landing 7 months later ended the SQuAD-as-benchmark era, but QANet’s hybrid shape lived on — showed up repeatedly in vision 4 years later.
Nyströmformer (Bestiary/Nystromformer.lean)
Zero new primitives. An efficient-attention transformer (Xiong et al. 2021) that replaces the \(O(n^2)\) softmax attention with an \(O(n)\) Nyström-approximation computed via \(m\) landmark tokens, \(m \ll n\). Crucially, this is a compute change, not a parameter change: the \(W_Q / W_K / W_V / W_O\) projections and the FFN are identical to a standard transformer. Our .transformerEncoder spec is therefore identical to BERT’s at each scale — base lands at 108M (vs BERT-base 109M), large at 333M (vs BERT-large 335M). The entry’s pedagogical value is prose-level: most of the 2020–2022 efficient-attention literature (Linformer, Performer, Longformer, BigBird, Reformer, FlashAttention) is parameter-identical to vanilla attention, and the bestiary can’t differentiate them at the layer level. Nyströmformer is worth highlighting because the specific trick — a 1928 result from numerical linear algebra, routed through kernel-method literature in the early 2000s, finally surfacing in transformers — is a genuinely long arc for one math idea.
WaveNet (Bestiary/WaveNet.lean)
Uses § 90. Dilated causal convolutions for raw audio sample prediction: exponential receptive field, linear parameter growth. The foundation of neural TTS and PixelCNN. Variants: single stack (0.4M), 3-stack (paper setup, with a NetSpec-linearity approximation), music (single stack, 4.1M), tiny. One honest limitation: the residual-vs-skip dual-output pattern doesn’t linearize cleanly, so the 3-stack variant uses a simplified channel-flow approximation.
Whisper (Bestiary/Whisper.lean)
Zero new primitives. Encoder-decoder transformer over log-mel spectrograms (Radford et al. 2022). Encoder is .transformerEncoder on the 1500 post-stem audio tokens; decoder is .transformerDecoder (from DETR) with nQueries = 0 — a clean seq2seq decoder with self-attn + cross-attn + FFN, text tokens coming from a separate .dense vocab\(\to \)D stand-in tied to the LM head. Variants: tiny (7M enc, paper 39M total), base (19M enc, paper 74M), small (85M enc + 153M dec+emb = 238M vs paper 244M), medium (302M enc, paper 769M), large (629M enc, paper 1.55B total), plus a shared decoder spec and tinyWhisper fixture. Encoder and decoder are the same size per variant; adding them together recovers paper totals within 2–3%. The multitask interface (swap a prefix token to change language or switch between transcribe / translate) is prompt-engineering, not architecture — Whisper’s architectural novelty is essentially zero.
NeRF (Bestiary/NeRF.lean)
Uses § 91 and § 92. The "it’s literally just an MLP" paper. Under 600K params at canonical config; the architectural novelty is nonexistent. What makes NeRF work is the positional encoding + the volumetric-rendering loss — both outside the network, not layers in it. Variants: canonical (593K), fast (167K hidden=128), tiny fixture.
CLIP (Bestiary/CLIP.lean)
Zero new primitives. Two textbook encoders glued together by a contrastive loss: a ResNet-50 or ViT-B for images, a 12-layer causal transformer for text, each with a final linear projection to a shared embedding space. Variants: CLIP-RN50, CLIP-ViT-B/32 (151M, paper-exact), CLIP-ViT-L/14 (427M, paper-exact), tiny fixture. The architectural lesson from CLIP is identical to NeRF’s: the novelty lives in data and objective, not in layer design.
LLaVA / LLaVA-1.5 (Bestiary/LLaVA.lean)
Zero new primitives. The cleanest exhibit of the modern vision-language pattern: frozen CLIP ViT encoder + small MLP projector + (mostly) frozen LLaMA/Vicuna language model. Trained in two stages — projector-only pretrain, then joint instruction fine- tune. Shown as separate NetSpecs per component: vision encoder (303M, matching CLIP ViT-L/14), LLaVA-1 single-linear projector (4M), LLaVA-1.5 two-layer MLP projector (21M), LLM backbones at 7B and 13B. The LLM specs undercount real LLaMA by \(\sim \)23% because our .transformerEncoder uses a standard 2-projection FFN while LLaMA uses SwiGLU (3 projections); depth / width / heads still match. Key ratio: the projector is 0.3% of total LLaVA-1.5 7B parameters — the entire interesting work of the model lives in 21M of trainable bolt-on between two pretrained backbones. BLIP-2, Flamingo, and every modern VLM demo generalize this template with fancier adapters (Q-Former, Perceiver resampler, gated xattn-dense), but the frozen-backbone-plus-adapter shape is the same.
AlphaFold 2 Evoformer (Bestiary/Evoformer.lean)
Uses § 85 and § 86. Dual-representation (MSA + pair) doesn’t fit a linear NetSpec cleanly; the bestiary shows the two bundled primitives and notes the limitation. Variants: full (76M), mini, tiny.