3B Parameters, Frontier-Level Reasoning
VibeThinker-3B hits 94.3 on AIME26 (97.1 with test-time scaling) and 80.2 Pass@1 on LiveCodeBench v6. These numbers place it in the same performance band as DeepSeek V3.2, GLM-5, and Gemini 3 Pro — models orders of magnitude larger. The 3B dense model also achieves 93.4 on IFEval, proving extreme reasoning doesn't break instruction following.
The Spectrum-to-Signal Post-Training Paradigm
The key innovation is a three-stage pipeline:
- Curriculum-based Supervised Fine-Tuning (SFT): Start with simpler reasoning tasks, gradually increase difficulty. This builds a strong foundation before reinforcement learning.
- Multi-domain Reinforcement Learning (RL): Apply GRPO (Group Relative Policy Optimization) across diverse verifiable domains — math, code, logic puzzles. The model learns to generate correct reasoning chains for each domain.
- Offline Self-Distillation: Use the best-performing checkpoints to generate synthetic training data, then retrain the model on that data. This compresses the reasoning knowledge without requiring additional human labels.
All of this builds on the Spectrum-to-Signal framework introduced in their earlier 1.5B work, but scaled to 3B with a more optimized pipeline.
Benchmark Breakdown
| Benchmark | Score | Notes |
|---|---|---|
| AIME26 | 94.3 (97.1 with test-time scaling) | Outperforms Opus 4.5 reportedly by 5+ points |
| LiveCodeBench v6 | 80.2 Pass@1 | Matches DeepSeek V3.2 |
| LeetCode unseen contests | 96.1% acceptance rate | Strong out-of-distribution generalization |
| IFEval | 93.4 | Strict instruction following preserved |
Parametric Compression-Coverage Hypothesis
The authors propose that verifiable reasoning can be compressed into compact "reasoning cores," while open-domain knowledge requires broad parameter coverage. This explains why a 3B model can match 100B+ models on reasoning tasks — reasoning logic is compressible, but knowing facts about the world requires many parameters.
Practical Implications for Developers
If you're building a code assistant or math solver, you can now run frontier-level reasoning on a single GPU. VibeThinker-3B fits in less than 6GB of memory (FP16). That means local inference without cloud dependencies. The model is available on Hugging Face (check the arXiv link for details).
How to Use It
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("vibethinker/VibeThinker-3B", torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained("vibethinker/VibeThinker-3B")
prompt = "Solve: If f(x) = x^2 + 2x + 1, find f(3)."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
This is a causal LM, so you can use standard generation APIs. For best results, use claim-level test-time scaling: generate multiple candidate answers, then have the model verify each one and pick the most consistent.
The Verdict
VibeThinker-3B is not just a smaller version of a larger model. It's a demonstration that reasoning ability can be decoupled from parameter count. If the Parametric Compression-Coverage Hypothesis holds, we'll see more compact models that excel at specific reasoning tasks while larger models handle broad knowledge.
For developers, this means you can deploy high-quality reasoning without massive infrastructure. Try it on your own benchmarks — especially if you're building tools for math, code generation, or formal verification.



