Leanstral 1.5: Proof Engineering for the Masses
Mistral AI released Leanstral 1.5, an Apache 2.0-licensed model with 119B total parameters but only 6B active (via mixture-of-experts). It saturates the miniF2F benchmark (100% on both validation and test), solves 587/672 PutnamBench problems, and sets new state-of-the-art on FATE-H (87%) and FATE-X (34%). Beyond benchmarks, it verified 57 Rust repositories and uncovered 5 previously unreported bugs.
Training: Three-Stage Pipeline for Formal Reasoning
Leanstral 1.5 undergoes mid-training, supervised fine-tuning, and reinforcement learning with CISPO. Two RL environments are used:
- Multiturn environment: Given a theorem statement, the model submits a proof, receives Lean compiler feedback, and refines until success or budget exhaustion.
- Code agent environment: The model operates on a raw filesystem, editing files, running bash commands, and using the Lean language server to inspect goals and errors in real time. It can complete partial proofs, build auxiliary lemmas, and persist through multiple rounds of context compaction.
Final verification uses a fork of SafeVerify to ensure correctness against target theorems.
Benchmark Results: Cost-Effective State-of-the-Art
Leanstral 1.5 achieves 100% on miniF2F. On PutnamBench, it solves 587/672 problems at roughly $4 per problem, compared to Seed-Prover 1.5 high at an estimated $300+ per problem (10 H20-days per problem). On FATE-H/X, it reaches 87% and 34% respectively, outperforming Goedel-Architect and AxProverBase.
Test-time scaling is monotonic: Pass@8 on PutnamBench rises from 44 at 50k token budget to 244 at 200k, 493 at 1M, and 587 at 4M. The model keeps reasoning across millions of tokens, revising code and proofs without giving up.
Code Verification: AVL Trees and Bug Discovery
AVL Trees: Proving O(log n) Time Complexity
Leanstral 1.5 proved time complexity guarantees for a real AVL tree implementation. The proof required structural induction, monadic time tracking with the TimeM monad, and exhaustive case analysis for rebalancing paths. Over 2.7 million tokens and 22 compactions, it established an almost tight bound of 48 steps per height unit plus a constant for insertion, then connected height to tree size via a logarithmic relationship.
Bug Discovery Pipeline
An automated pipeline was built: Aeneas translates Rust code to Lean, Leanstral infers user intent and generates correctness properties. It attempts to prove each property (four attempts), and if all fail, tries to prove the negation (four attempts). Across 57 repositories, this flagged 47 violated properties, 11 pointing to genuine bugs, 5 previously unreported.
One bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode — an edge case that testing and fuzzing would typically miss.
Getting Started
Leanstral 1.5 is available on Hugging Face and as a free API endpoint (leanstral-1-5). Recommended usage via Mistral Vibe:
# Install Mistral Vibe
uv tool install mistral-vibe
uv tool update mistral-vibe
vibe --setup
# Install Leanstral 1.5 (via Vibe setup)
# Launch the agent
vibe --model leanstral-1-5
# Optional: Install Lean LSP MCP
# Add to ~/.vibe/config.toml:
[[mcp_servers]]
name = "lean-lsp"
transport = "stdio"
command = "uvx"
args = ["lean-lsp-mcp"]
tool_timeout_sec = 600
Then ask Leanstral to tackle a theorem, debug a proof, or contribute to a repository.
Why It Matters
Formal verification has been a niche skill requiring deep expertise and expensive compute. Leanstral 1.5 brings proof engineering to any developer with a free API key. Its ability to find real bugs in Rust code shows that formal methods are no longer just for academics — they can be part of daily development workflows. At $4 per Putnam problem, it's cheap enough to run at scale.





