The 12% Gap That Isn't

Qwen 3.6 27B scores 77.2% on SWE-Bench Verified. Claude Opus 4.8 scores 88.6%. Many developers shout "local is only 12% behind SOTA!" But Alex Ellis, founder of OpenFaaS, has a different take: benchmarks don't tell the full story.

Ellis runs a small software business building infrastructure tools like OpenFaaS, SlicerVM, and Actuated. He's been using AI since tab-completion in VS Code. Today, Claude or Codex do most of his coding—he rarely writes code by hand. But he also runs Qwen 3.6 27B locally on an RTX 6000 Pro. His conclusion: local models are a different tool, not a worse one.

Where Local Models Shine

Ellis identifies three concrete advantages:

  1. Fixed costs — Cloud coding plans cost ~$200/month per developer. At Uber, where spend is capped at $1500/month/developer/tool, heavy agentic use can eat 12% of a developer's $330k salary. Local models eliminate variable token costs.

  2. Privacy and sovereignty — Ellis's products (OpenFaaS, Inlets, SlicerVM) all run on customer infrastructure. Sending code to Anthropic or OpenAI isn't an option for enterprise clients with strict data controls.

  3. Vendor independence — When Anthropic removed the Fable 5 model overnight, non-US developers felt the pain. Local models insulate against "What if the frontier labs do X?"

The Infinite Loop Problem

Ellis's key technical complaint: Qwen 3.6 27B in Q4 quantization with 200K context can do small guided tasks, but fails on open-ended exploration. He gave it a simple instruction: "Explore this machine from every angle, complete a forensic report." Qwen read every file on his machine, filled its context, then hallucinated filenames and tool calls.

He compares it to tempering a blade: "The model is running so hot, it shoots past the goal and starts looping. Nothing can fix it other than clearing the context." He'd never leave Qwen working unattended on a long-horizon task.

The Cost Reality Check

Cloud coding plans are subsidized. Ellis points to GitHub Copilot's shift from flat-rate to token-based pricing as evidence. The true cost of API calls is hidden. For a small team managing multiple products, $200/month/developer adds up. Ellis's RTX 6000 Pro paid for itself in 2-3 months.

But cost isn't the only factor. Ellis writes: "It's not fair to rule out cost, but for many it's not about that." The real value is in sovereignty and predictable expenses.

When to Use Local vs. Cloud

Ellis's workflow: Claude Opus handles complex, multi-step tasks like debugging a VSock file descriptor leak across Slicer VMs. He pastes a problem, Claude diagnoses it, implements a fix, tests it, and raises a PR—all unattended. Qwen handles smaller, guided tasks where privacy matters or where he wants to avoid API costs.

For example, generating a landing page component or refactoring a Go function that doesn't cross module boundaries. Qwen can do this reliably if the task is well-scoped and the context window is managed.

Practical Setup

Ellis runs Qwen 3.6 27B on a single RTX 6000 Pro (48GB VRAM) with Q4 quantization and ~200K quantized context. He uses opencode as the agent harness. He warns against quantizing further to fit consumer GPUs—the infinite loop problem worsens.

# Example: Load Qwen 3.6 27B Q4 with 200K context
# (hypothetical command based on common tools like llama.cpp)
./main -m qwen-3.6-27b-q4_0.gguf -c 204800 --mlock

The Bottom Line

Local models aren't a replacement for SOTA cloud models. They're a complement. Use Claude Opus for unattended, complex, multi-step tasks. Use Qwen for privacy-sensitive, well-scoped, guided tasks where cost predictability matters.

Ellis's advice: Don't benchmark-max. Test your actual workflow. If you need a model to explore a system autonomously, stick with cloud. If you need a model to write a specific function without leaking data, go local.

What You Should Do Now

  • Identify which parts of your workflow require unattended reasoning (use cloud) and which are structured enough for local models.
  • Run a side-by-side: pick one task you do weekly, try it with Claude Opus and with Qwen 3.6 27B local. Measure time, cost, and quality.
  • If you handle sensitive code, invest in a GPU with at least 48GB VRAM. The RTX 6000 Pro paid for itself in 2-3 months for Ellis.