10 Attention and the ViT finale
\(\partial \mathrm{softmax}(z)_j / \partial z_i = p_j (\delta _{ij} - p_i)\).
Closed-form collapse: \(dz_i = p_i (dy_i - \langle p, dy \rangle )\).
Composed from four proved \(\mathsf{HasVJPMat}\) building blocks.
Phase 8: the one bundled axiom needed to lift single-head SDPA to full multi-head attention. The "vmap over head axis" primitive.
One vjpMat_comp glueing the two sublayers.
Covers ViT-Tiny (12 blocks), ViT-Base (12), ViT-Large (24).
The full ViT transformer backbone as one \(\mathsf{HasVJPMat}\), derived from exactly one new axiom (mhsa_has_vjp_mat).