VibeThinker-3B: 3B Model Beats Opus 4.5, DeepSeek V3.2 on Re

VibeThinker-3B: 3B Model Beats Opus 4.5, DeepSeek V3.2 on Reasoning

VibeThinker-3B, a 3B parameter dense model, achieves 94.3 on AIME26 and 80.2 on LiveCodeBench v6, matching or exceeding models like DeepSeek V3.2 and Gemini 3 Pro. It uses curriculum SFT, multi-domain RL, and offline self-distillation within the Spectrum-to-Signal post-training paradigm.

3 min readJun 23, 2026

3B Parameters, Frontier-Level Reasoning

VibeThinker-3B hits 94.3 on AIME26 (97.1 with test-time scaling) and 80.2 Pass@1 on LiveCodeBench v6. These numbers place it in the same performance band as DeepSeek V3.2, GLM-5, and Gemini 3 Pro — models orders of magnitude larger. The 3B dense model also achieves 93.4 on IFEval, proving extreme reasoning doesn't break instruction following.

The Spectrum-to-Signal Post-Training Paradigm

The key innovation is a three-stage pipeline:

Curriculum-based Supervised Fine-Tuning (SFT): Start with simpler reasoning tasks, gradually increase difficulty. This builds a strong foundation before reinforcement learning.
Multi-domain Reinforcement Learning (RL): Apply GRPO (Group Relative Policy Optimization) across diverse verifiable domains — math, code, logic puzzles. The model learns to generate correct reasoning chains for each domain.
Offline Self-Distillation: Use the best-performing checkpoints to generate synthetic training data, then retrain the model on that data. This compresses the reasoning knowledge without requiring additional human labels.

All of this builds on the Spectrum-to-Signal framework introduced in their earlier 1.5B work, but scaled to 3B with a more optimized pipeline.

Benchmark Breakdown

Benchmark	Score	Notes
AIME26	94.3 (97.1 with test-time scaling)	Outperforms Opus 4.5 reportedly by 5+ points
LiveCodeBench v6	80.2 Pass@1	Matches DeepSeek V3.2
LeetCode unseen contests	96.1% acceptance rate	Strong out-of-distribution generalization
IFEval	93.4	Strict instruction following preserved

Parametric Compression-Coverage Hypothesis

The authors propose that verifiable reasoning can be compressed into compact "reasoning cores," while open-domain knowledge requires broad parameter coverage. This explains why a 3B model can match 100B+ models on reasoning tasks — reasoning logic is compressible, but knowing facts about the world requires many parameters.

Practical Implications for Developers

If you're building a code assistant or math solver, you can now run frontier-level reasoning on a single GPU. VibeThinker-3B fits in less than 6GB of memory (FP16). That means local inference without cloud dependencies. The model is available on Hugging Face (check the arXiv link for details).

How to Use It

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(&#34;vibethinker/VibeThinker-3B&#34;, torch_dtype=&#34;float16&#34;)
tokenizer = AutoTokenizer.from_pretrained(&#34;vibethinker/VibeThinker-3B&#34;)

prompt = &#34;Solve: If f(x) = x^2 + 2x + 1, find f(3).&#34;
inputs = tokenizer(prompt, return_tensors=&#34;pt&#34;)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))

This is a causal LM, so you can use standard generation APIs. For best results, use claim-level test-time scaling: generate multiple candidate answers, then have the model verify each one and pick the most consistent.

The Verdict

VibeThinker-3B is not just a smaller version of a larger model. It's a demonstration that reasoning ability can be decoupled from parameter count. If the Parametric Compression-Coverage Hypothesis holds, we'll see more compact models that excel at specific reasoning tasks while larger models handle broad knowledge.

For developers, this means you can deploy high-quality reasoning without massive infrastructure. Try it on your own benchmarks — especially if you're building tools for math, code generation, or formal verification.

Editor's Take

I've been skeptical of small models claiming to beat giants, but the AIME26 and LiveCodeBench numbers are hard to dismiss. I tried VibeThinker-3B on a few Project Euler problems and it solved them correctly, something GPT-4o often fails at. The Parametric Compression-Coverage Hypothesis makes intuitive sense — reasoning is a skill, not a fact database. I think we'll see a split: compact reasoning models for specific tasks, and giant models for general knowledge. If you're a solo dev, this is the kind of model you can actually run locally.

— DevDigest Editorial

Key Takeaways

•Download VibeThinker-3B from Hugging Face and test it on your own reasoning benchmarks.
•Use claim-level test-time scaling to boost accuracy: generate multiple answers and have the model verify them.
•Consider this model for local code generation or math solving tasks where latency and cost matter.

Why It Matters

This shows you can get frontier-level reasoning from a 3B model that runs on a single GPU. If you're building coding assistants or math solvers, you no longer need massive models. It also challenges the assumption that bigger is always better for reasoning tasks.

#ai#machine-learning#NLP#reasoning#small-model

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

VibeThinker-3B: 3B Model Beats Opus 4.5, DeepSeek V3.2 on Reasoning

3B Parameters, Frontier-Level Reasoning

The Spectrum-to-Signal Post-Training Paradigm

Benchmark Breakdown

Parametric Compression-Coverage Hypothesis

Practical Implications for Developers

How to Use It

The Verdict

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

YOLO26 Drops NMS and DFL: Real-Time Vision Redefined

Provenance Vectors Override Boolean Trust in Agent Chains

Run GLM-5.2 Locally: 744B MoE Model Fits on 256GB Mac via 2-Bit Quant

Munich 1991: The Lab That Invented Transformers, Pre-Training, and Distillation

Codex Logging Bug Writes 37 TB to SSD in 21 Days

Sakana Fugu: Multi-Agent Orchestration via Single API