A Getting started

If you want to just see it run — train a real neural network, end to end, with the Lean \(\to \) MLIR \(\to \) IREE pipeline from Part 1 — this appendix is the smallest path from zero to a trained model. Three tracks are included:

No-GPU Docker demo. Train MNIST on CPU in \(\sim \)5 minutes. Zero setup beyond Docker. Good if you don’t have a GPU and just want to confirm the pipeline works.
Native install (CUDA or ROCm). Train real image models (ResNet, MobileNet, EfficientNet, ViT) on your GPU. Requires an IREE build step but gives you the full book’s training runs.
Lean-only (no IREE). Read, build, and verify the proofs without running any neural-network training. Covers everything in Part 1 at the proof level.

Track 1: No-GPU Docker demo

The repository ships a Dockerfile that builds a CPU-only MNIST demo in a multi-stage image. Pulls Lean 4, builds IREE’s CPU runtime, compiles the MLP trainer, and bakes MNIST into the image. Once built, the final image is \(\sim \)300 MB and doesn’t require a GPU or Python.

git clone https://github.com/brettkoonce/lean4-mlir.git
cd lean4-mlir
docker build -t lean4-mlir-demo .
docker run --rm lean4-mlir-demo

Output: the same three-layer MLP shown in Chapter 3 (784 \(\to \) 512 \(\to \) 512 \(\to \) 10) trains for 12 epochs on real MNIST, converges to \(\sim \)97.9% test accuracy. First build takes \(\sim \)10 minutes (dominated by IREE CMake + Ninja); subsequent docker run invocations reuse the cached image.

This is the fastest way to verify the Lean-to-MLIR-to-IREE pipeline actually works on your hardware before committing to a larger setup.

Track 2: Native install (CUDA or ROCm)

For training the larger networks (ResNet-34, MobileNet v2/v3/v4, EfficientNet, ViT) you need a GPU-accelerated IREE runtime. Steps:

Install Lean 4. Uses elan (like rustup for Rust) to manage toolchain versions.

curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh \
  -sSf | sh

Build the IREE runtime for your GPU backend. See IREE_BUILD.md in the repo for the full CMake invocation. Once built, the FFI shim in ffi/ links against libiree_runtime_unified.a from your build tree.

Fetch data.

./download_imagenette.sh   # Imagenette 320px, ~300 MB preprocessed

Build a trainer.
```
lake build resnet34-train
```
Other available targets, in roughly book order: mnist-mlp-train, mnist-cnn-train, cifar-cnn-train, cifar-bn-train, resnet50-train, mobilenet-v2-train, mobilenet-v3-train, mobilenet-v4-train, efficientnet-train, efficientnet-v2-train, vgg-train, vit-tiny-train.
Run, setting the backend env vars. The FFI picks up IREE_BACKEND and IREE_CHIP at runtime.

NVIDIA / CUDA:
```
CUDA_VISIBLE_DEVICES=0 IREE_BACKEND=cuda IREE_CHIP=sm_86 \
  .lake/build/bin/resnet34-train
```
(sm_86 is Ampere — RTX 30-series, A100. Use sm_89 for Ada/RTX 40-series, sm_90 for Hopper/H100.)

AMD / ROCm:
```
HIP_VISIBLE_DEVICES=0 IREE_BACKEND=rocm IREE_CHIP=gfx1100 \
  .lake/build/bin/resnet34-train
```
(gfx1100 is the 7900 XT/XTX. Use gfx90a for the MI200 series, gfx942 for MI300.)

CPU fallback:
```
IREE_BACKEND=llvm-cpu .lake/build/bin/resnet34-train
```
(No chip argument. Slower than GPU but useful for debugging trainers without a working GPU setup.)

First-run compile is slow. IREE spends 10–15 minutes compiling a ResNet-sized training step to vmfb. Subsequent runs reuse the cached vmfb under .lake/build/ unless you delete it. The first-compile is a one-time cost per architecture, per GPU target.

Shell wrapper. The repo ships run.sh which sets the right env vars and tees output to a log file. Usage: ./run.sh <trainer> [gpu] [backend] (e.g. ./run.sh resnet34; ./run.sh efficientnet-v2 1 cuda).

Track 3: Proofs only (no IREE needed)

If you just want to read the book and verify the proofs, you can skip IREE entirely. The proofs don’t depend on the runtime.

git clone https://github.com/brettkoonce/lean4-mlir.git
cd lean4-mlir
lake exe cache get             # pull ~5 GB of precompiled Mathlib
lake build LeanMlir.Proofs.MLP LeanMlir.Proofs.CNN \
           LeanMlir.Proofs.BatchNorm LeanMlir.Proofs.Residual \
           LeanMlir.Proofs.Depthwise LeanMlir.Proofs.SE \
           LeanMlir.Proofs.LayerNorm LeanMlir.Proofs.Attention

If the build succeeds, every theorem in the proof suite type-checked against Lean’s kernel — that is the entire verification. Zero sorrys, zero project axioms, every VJP formula proved against Mathlib’s \(\operatorname {fderiv}\) (see Appendix B).

To also run the numerical gradient checks:

python3 LeanMlir/Proofs/check_axioms.py

(The script’s name predates the axiom-count going to zero.) Every closed-form Jacobian formula in the suite has a finite-difference test that compares the formula against the numerical derivative. The check typically completes in under a minute and reports pass/fail per formula.

Common troubleshooting

HAL device not found on launch — usually means the IREE runtime wasn’t built with the requested backend’s HAL driver. CPU fallback (IREE_BACKEND=llvm-cpu) is the safest diagnostic: if it works on CPU but not GPU, the problem is in your IREE build’s HAL drivers, not in the Lean code.
lake build fails with Mathlib cache errors — run lake exe cache get to populate the local cache. Without it, the first build compiles all of Mathlib from source (\(\sim \)45 min). With it, Mathlib oleans are downloaded directly (\(\sim \)30 seconds).
First IREE compile appears stuck — it isn’t. ResNet-34’s full training-step compile takes 10–15 minutes. Watch RAM usage; if it’s climbing steadily the compile is progressing. If you want a faster check, try the MNIST MLP first (compiles in \(\sim \)30 seconds).