Verified Deep Learning with Lean 4

10 Attention and the ViT finale

Axiom 60 Softmax Jacobian
#

\(\partial \mathrm{softmax}(z)_j / \partial z_i = p_j (\delta _{ij} - p_i)\).

Theorem 61 Standalone softmax VJP
#

Closed-form collapse: \(dz_i = p_i (dy_i - \langle p, dy \rangle )\).

Theorem 62 Row-wise softmax VJP on a matrix
#
Theorem 63 SDPA backward wrt Q

Composed from four proved \(\mathsf{HasVJPMat}\) building blocks.

Theorem 64 SDPA backward wrt K
Theorem 65 SDPA backward wrt V
#
Axiom 66 Multi-head SDPA VJP
#

Phase 8: the one bundled axiom needed to lift single-head SDPA to full multi-head attention. The "vmap over head axis" primitive.

Theorem 67 Per-token LayerNorm lifted to a matrix
Theorem 68 Per-token dense lifted to a matrix
#
Theorem 69 Per-token GELU lifted to a matrix
#
Theorem 70 Transformer MLP sublayer VJP
Theorem 71 Transformer attention sublayer with residual VJP
Theorem 72 Transformer MLP sublayer with residual VJP
Theorem 73 Transformer block VJP

One vjpMat_comp glueing the two sublayers.

Theorem 74 Transformer tower, any depth via induction

Covers ViT-Tiny (12 blocks), ViT-Base (12), ViT-Large (24).

Theorem 75 ViT body: the grand finale

The full ViT transformer backbone as one \(\mathsf{HasVJPMat}\), derived from exactly one new axiom (mhsa_has_vjp_mat).