vLLM Semantic Router: Micro-Agents Inside the Model API

vLLM just shipped Semantic Router, and it changes what a "model API" means. Instead of routing requests to a single backend, the router can orchestrate multiple models behind one stable endpoint. The user calls vllm-sr/auto—the router decides the collaboration pattern.

The numbers are concrete. On GPQA-Diamond, the router scores 96.0% (VSR Closed), beating Fugu Ultra (95.5%) and GPT-5.5 (93.6%). On LiveCodeBench (Jan-Apr 2025), it hits 92.6%, ahead of GPT-5.5 (90.7%) and Opus 4.8 (90.3%). On Humanity's Last Exam, it matches Fugu Ultra at 50.0%. These aren't synthetic benchmarks—they're hard reasoning, coding, and long-form tasks.

The Router Is Not a Load Balancer

Old routers just picked a model. This one executes algorithms. The core primitive is the looper—a bounded runtime for micro-agents. Five looper patterns are documented:

  1. Confidence: sequential escalation. Try a cheap model, measure logprob confidence, escalate if too low. Tunable thresholds and failure policies.
  2. Ratings: parallel fan-out under a hard concurrency cap. Aggregates with rating-aware weights.
  3. ReMoM: repeated mixture-of-model reasoning. Fan out breadth samples, wait for quorum, synthesize final answer.
  4. Fusion: panel-judge-final. Independent responses become evidence for a judge and finalizer. Disagreement is signal, not noise.
  5. Workflows: micro-agent runtime with planner, bounded steps, and output contract enforcement.

Each looper has explicit budget, topology, trace, and failure policy. This is infrastructure, not app glue.

Code Example: One Model Name, Many Loops

{
  "model": "vllm-sr/auto",
  "messages": [{"role": "user", "content": "Explain quantum entanglement to a 10-year-old."}]
}

That single call triggers signal extraction, task-shape projection, risk band matching, and algorithm selection. The response is a normal OpenAI-compatible JSON. No client changes.

Why This Matters for Production

The router owns model aliases, provider credentials, cost metadata, retries, timeouts, and traces. Adding collaboration logic to the router means operators control the recipe. The application never sees the complexity. The system improves without client integration changes.

For SWE-style tasks, Workflows can express a planner, patcher, verifier, and finalizer—without letting the application own a bespoke agent stack. The loop is powerful but governed by infrastructure.

The Takeaway

vLLM Semantic Router is open-source and integrates with the existing vLLM serving stack. If you're running production inference, this is the next step: make your model API a system boundary, not just a checkpoint call. Start with the vllm-sr/auto model name, tune the recipes, and let the router handle collaboration.

Check the vLLM blog for setup instructions and recipe configuration.