1
Introduction
▶
What this book is about
The thesis
Who this book is for
Why now
Why image recognition
Why Lean
Let’s go
2
You Are Here
▶
2.1
Example: MNIST linear classifier
▶
The forward pass and loss, in math.
The same thing, in NetSpec.
Training program and run command.
The gradient, in math.
Results.
Where this goes.
2.2
What’s inside
.train
?
▶
The training loop, the way a math book would write it.
The same loop, in Lean.
2.3
How this book is organized
▶
Foundation: Mathlib’s
fderiv
Why VJPs, not Jacobians?
Theorem and definition budget per chapter
Roadmap: skip to your target architecture
For readers of the first book
3
MNIST: 1D MLP
▶
The pdiv we built last chapter, now applied
The Jacobian as multidimensional “how does it jiggle?”
The dense layer’s Jacobian is the weight matrix
ReLU: the piecewise case
Softmax cross-entropy: where vectors collapse to a scalar
From Jacobians to VJPs
The theorems
3.1
Example: MNIST MLP
▶
Dataset overview
Architecture
Code: the full training program
Results
What is actually in those “20 502 chars”
4
MNIST: 2D CNN
▶
The MLP wastes information; we can do better
Three properties baked into a convolution
The Jacobian is sparse and structured
The backward of a convolution is another convolution
Max pooling: the winner gets the gradient
The rest of the network is Chapter 3
The theorems
4.1
Example: MNIST 2D CNN
▶
Architecture
Code: the full training program
Results
Why this network is still 99% dense-head
5
CIFAR with BatchNorm
▶
What BN actually is
The centering term has an indirect path
The inverse-stddev term is the hard part
The famous three-term backward
The affine step is just dense-with-broadcasting
Putting it together
The theorems
5.1
Example: the BN lift on CIFAR
▶
The two specs, differing by one keyword per layer
Training: one learns, one doesn’t
Where the BN lift actually lives
6
ResNet-34
▶
Deeper networks used to stop training
The residual block: ask for the difference, not the function
Why the proof is a one-liner
ResNet-34 is sixteen of these blocks, stacked
The theorem
6.1
Example: ResNet-34 on Imagenette
▶
The architecture
Results
6.2
What’s in the production recipe?
6.3
Ablation: what each ingredient contributes
6.4
ImageNet recipe
A reasonable stopping point
▶
Or, for breadth.
7
MobileNetV2
▶
A standard conv does two jobs at once
Depthwise: one kernel per channel, no cross-channel sum
The depthwise-separable sandwich
Inverted residual: depthwise at the wide point
Stacking the blocks
The theorems
7.1
Example: MobileNet V2 on Imagenette
▶
The architecture
Results
8
EfficientNet
▶
A convolution treats every channel equally
Three steps, one tensor product
From the framework: it’s an elementwise product
This is attention, in disguise, before transformers
EfficientNet-B0: MobileNet-V2 with SE everywhere
The theorem
8.1
Example: EfficientNet-B0 on Imagenette
▶
The architecture
Results
8.2
Side quests
9
ConvNeXt
▶
ConvNets strike back
Two new primitives
The ConvNeXt block
ConvNeXt-T is four stages, ResNet-style
The theorems
9.1
Example: ConvNeXt-T on Imagenette
▶
The architecture
Results
9.2
Data Augmentation
10
Vision Transformer
▶
Where Part 1 has been heading
What attention is, in one screen
Why this chapter has thirty theorems
What’s actually proved
10.1
Matrix-level machinery
10.2
Attention proofs
10.3
Example: ViT-Tiny on Imagenette
▶
The architecture
Results
10.4
Data Augmentation
10.5
ImageNet recipe
▶
What Part 1 has established
11
Bestiary of Architectures
▶
What “\(N\) new primitives” means.
11.1
Bestiary-only
Layer
primitives
11.2
Bestiary entries
▶
11.2.1
Vision classifiers — Part 1’s primitives at scale
11.2.2
Object detection
11.2.3
Semantic segmentation
11.2.4
Image generation
11.2.5
Reinforcement learning
11.2.6
Natural language processing
11.2.7
Diffusion
11.2.8
Beyond vision
A
Getting started
▶
Track 1: No-GPU Docker demo
Track 2: Native install (CUDA or ROCm)
Track 3: Proofs only (no IREE needed)
Common troubleshooting
B
On Verification
▶
B.1
Trust kernel
B.2
Finite-difference gradient checks
B.3
The JAX parallel pipeline
B.4
Independent kernel re-check (comparator)
B.5
The four layers, summarized
Acknowledgments
Dependency graph
Verified Deep Learning with Lean 4
Brett Koonce
1
Introduction
What this book is about
The thesis
Who this book is for
Why now
Why image recognition
Why Lean
Let’s go
2
You Are Here
2.1
Example: MNIST linear classifier
The forward pass and loss, in math.
The same thing, in NetSpec.
Training program and run command.
The gradient, in math.
Results.
Where this goes.
2.2
What’s inside
.train
?
The training loop, the way a math book would write it.
The same loop, in Lean.
2.3
How this book is organized
Foundation: Mathlib’s
fderiv
Why VJPs, not Jacobians?
Theorem and definition budget per chapter
Roadmap: skip to your target architecture
For readers of the first book
3
MNIST: 1D MLP
The pdiv we built last chapter, now applied
The Jacobian as multidimensional “how does it jiggle?”
The dense layer’s Jacobian is the weight matrix
ReLU: the piecewise case
Softmax cross-entropy: where vectors collapse to a scalar
From Jacobians to VJPs
The theorems
3.1
Example: MNIST MLP
Dataset overview
Architecture
Code: the full training program
Results
What is actually in those “20 502 chars”
4
MNIST: 2D CNN
The MLP wastes information; we can do better
Three properties baked into a convolution
The Jacobian is sparse and structured
The backward of a convolution is another convolution
Max pooling: the winner gets the gradient
The rest of the network is Chapter 3
The theorems
4.1
Example: MNIST 2D CNN
Architecture
Code: the full training program
Results
Why this network is still 99% dense-head
5
CIFAR with BatchNorm
What BN actually is
The centering term has an indirect path
The inverse-stddev term is the hard part
The famous three-term backward
The affine step is just dense-with-broadcasting
Putting it together
The theorems
5.1
Example: the BN lift on CIFAR
The two specs, differing by one keyword per layer
Training: one learns, one doesn’t
Where the BN lift actually lives
6
ResNet-34
Deeper networks used to stop training
The residual block: ask for the difference, not the function
Why the proof is a one-liner
ResNet-34 is sixteen of these blocks, stacked
The theorem
6.1
Example: ResNet-34 on Imagenette
The architecture
Results
6.2
What’s in the production recipe?
6.3
Ablation: what each ingredient contributes
6.4
ImageNet recipe
A reasonable stopping point
Or, for breadth.
7
MobileNetV2
A standard conv does two jobs at once
Depthwise: one kernel per channel, no cross-channel sum
The depthwise-separable sandwich
Inverted residual: depthwise at the wide point
Stacking the blocks
The theorems
7.1
Example: MobileNet V2 on Imagenette
The architecture
Results
8
EfficientNet
A convolution treats every channel equally
Three steps, one tensor product
From the framework: it’s an elementwise product
This is attention, in disguise, before transformers
EfficientNet-B0: MobileNet-V2 with SE everywhere
The theorem
8.1
Example: EfficientNet-B0 on Imagenette
The architecture
Results
8.2
Side quests
9
ConvNeXt
ConvNets strike back
Two new primitives
The ConvNeXt block
ConvNeXt-T is four stages, ResNet-style
The theorems
9.1
Example: ConvNeXt-T on Imagenette
The architecture
Results
9.2
Data Augmentation
10
Vision Transformer
Where Part 1 has been heading
What attention is, in one screen
Why this chapter has thirty theorems
What’s actually proved
10.1
Matrix-level machinery
10.2
Attention proofs
10.3
Example: ViT-Tiny on Imagenette
The architecture
Results
10.4
Data Augmentation
10.5
ImageNet recipe
What Part 1 has established
11
Bestiary of Architectures
What “\(N\) new primitives” means.
11.1
Bestiary-only
Layer
primitives
11.2
Bestiary entries
11.2.1
Vision classifiers — Part 1’s primitives at scale
11.2.2
Object detection
11.2.3
Semantic segmentation
11.2.4
Image generation
11.2.5
Reinforcement learning
11.2.6
Natural language processing
11.2.7
Diffusion
11.2.8
Beyond vision
A
Getting started
Track 1: No-GPU Docker demo
Track 2: Native install (CUDA or ROCm)
Track 3: Proofs only (no IREE needed)
Common troubleshooting
B
On Verification
B.1
Trust kernel
B.2
Finite-difference gradient checks
B.3
The JAX parallel pipeline
B.4
Independent kernel re-check (comparator)
B.5
The four layers, summarized
Acknowledgments