Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F,

Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F, Finds 5 Bugs

Mistral AI's Leanstral 1.5, a 6B active parameter model, achieves 100% on miniF2F, solves 587/672 PutnamBench problems, and uncovers 5 previously unknown bugs across 57 Rust repositories. Fully open-sourced under Apache 2.0, it marks a leap in accessible formal verification.

4 min readJul 4, 2026

Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F, Finds 5 Bugs

Leanstral 1.5: Proof Engineering for the Masses

Mistral AI released Leanstral 1.5, an Apache 2.0-licensed model with 119B total parameters but only 6B active (via mixture-of-experts). It saturates the miniF2F benchmark (100% on both validation and test), solves 587/672 PutnamBench problems, and sets new state-of-the-art on FATE-H (87%) and FATE-X (34%). Beyond benchmarks, it verified 57 Rust repositories and uncovered 5 previously unreported bugs.

Training: Three-Stage Pipeline for Formal Reasoning

Leanstral 1.5 undergoes mid-training, supervised fine-tuning, and reinforcement learning with CISPO. Two RL environments are used:

Multiturn environment: Given a theorem statement, the model submits a proof, receives Lean compiler feedback, and refines until success or budget exhaustion.
Code agent environment: The model operates on a raw filesystem, editing files, running bash commands, and using the Lean language server to inspect goals and errors in real time. It can complete partial proofs, build auxiliary lemmas, and persist through multiple rounds of context compaction.

Final verification uses a fork of SafeVerify to ensure correctness against target theorems.

Benchmark Results: Cost-Effective State-of-the-Art

Leanstral 1.5 achieves 100% on miniF2F. On PutnamBench, it solves 587/672 problems at roughly $4 per problem, compared to Seed-Prover 1.5 high at an estimated $300+ per problem (10 H20-days per problem). On FATE-H/X, it reaches 87% and 34% respectively, outperforming Goedel-Architect and AxProverBase.

Test-time scaling is monotonic: Pass@8 on PutnamBench rises from 44 at 50k token budget to 244 at 200k, 493 at 1M, and 587 at 4M. The model keeps reasoning across millions of tokens, revising code and proofs without giving up.

Code Verification: AVL Trees and Bug Discovery

AVL Trees: Proving O(log n) Time Complexity

Leanstral 1.5 proved time complexity guarantees for a real AVL tree implementation. The proof required structural induction, monadic time tracking with the TimeM monad, and exhaustive case analysis for rebalancing paths. Over 2.7 million tokens and 22 compactions, it established an almost tight bound of 48 steps per height unit plus a constant for insertion, then connected height to tree size via a logarithmic relationship.

Bug Discovery Pipeline

An automated pipeline was built: Aeneas translates Rust code to Lean, Leanstral infers user intent and generates correctness properties. It attempts to prove each property (four attempts), and if all fail, tries to prove the negation (four attempts). Across 57 repositories, this flagged 47 violated properties, 11 pointing to genuine bugs, 5 previously unreported.

One bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode — an edge case that testing and fuzzing would typically miss.

Getting Started

Leanstral 1.5 is available on Hugging Face and as a free API endpoint (leanstral-1-5). Recommended usage via Mistral Vibe:

# Install Mistral Vibe
uv tool install mistral-vibe
uv tool update mistral-vibe
vibe --setup

# Install Leanstral 1.5 (via Vibe setup)
# Launch the agent
vibe --model leanstral-1-5

# Optional: Install Lean LSP MCP
# Add to ~/.vibe/config.toml:
[[mcp_servers]]
name = &#34;lean-lsp&#34;
transport = &#34;stdio&#34;
command = &#34;uvx&#34;
args = [&#34;lean-lsp-mcp&#34;]
tool_timeout_sec = 600

Then ask Leanstral to tackle a theorem, debug a proof, or contribute to a repository.

Why It Matters

Formal verification has been a niche skill requiring deep expertise and expensive compute. Leanstral 1.5 brings proof engineering to any developer with a free API key. Its ability to find real bugs in Rust code shows that formal methods are no longer just for academics — they can be part of daily development workflows. At $4 per Putnam problem, it's cheap enough to run at scale.

Editor's Take

I've been skeptical of AI for formal verification, but Leanstral 1.5's bug discovery pipeline is genuinely useful. Finding 5 unreported bugs in real Rust repos — including an overflow edge case — is more than a benchmark win. My concern is that the model still needs Lean expertise to interpret its proofs, but the free API and open weights lower the barrier. I'd love to see it integrated directly into CI pipelines.

— DevDigest Editorial

Key Takeaways

•Use Leanstral 1.5 via free API to verify code correctness in Lean 4, especially for safety-critical Rust code translated via Aeneas.
•Deploy the bug discovery pipeline on your own repos: translate Rust to Lean, let Leanstral infer properties and attempt proofs.
•Leverage test-time scaling: increase token budget for harder proofs — Leanstral keeps improving with more compute.

Why It Matters

Leanstral 1.5 makes formal verification accessible and practical for everyday development. It finds bugs that testing misses, verifies complex code properties, and costs a fraction of previous tools. Any developer can now use it to prove correctness of their code.

#open-source#ai-models#formal-verification#lean#rust

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F, Finds 5 Bugs

Leanstral 1.5: Proof Engineering for the Masses

Training: Three-Stage Pipeline for Formal Reasoning

Benchmark Results: Cost-Effective State-of-the-Art

Code Verification: AVL Trees and Bug Discovery

AVL Trees: Proving O(log n) Time Complexity

Bug Discovery Pipeline

Getting Started

Why It Matters

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

Google's TabFM: Zero-shot tabular classification without training

Moondream's Photon Engine Hides GPU Bubbles with Pipelined Decoding

vLLM Semantic Router: Micro-Agents Inside the Model API

14× Faster Embeddings: Manticore 27.1.5 Ships ONNX Runtime Backend

crustc: The Entire rustc Compiler Translated to 46M Lines of C