GLM 5.2 beats Claude in our benchmarks

Semgrep ran a benchmark comparing open-weight models against frontier coding agents on a security task: detecting Insecure Direct Object References (IDORs). The result: GLM 5.2, an open-weight model from Zhipu AI, scored 39% F1, beating Claude Code (32%) at roughly $0.17 per vulnerability found. That's 1/6 the cost of comparable frontier models.

Semgrep wasn't trying to crown an open-weight champion. They wanted to measure how much performance comes from the model versus the harness—the scaffolding that feeds code to the model, parses output, and loops through tasks. Their internal multimodal pipeline (with harness) scored 53-61% F1, but the open-weight models got only a simple Pydantic AI harness with the same IDOR prompt.

What is GLM 5.2?

GLM 5.2 is a Mixture-of-Experts (MoE) model with ~750B total parameters, ~40B active per token. It supports up to 1M tokens of context and posts strong coding benchmarks: 81.0 on Terminal-Bench 2.1 (vs 63.5 for GLM 5.1) and 62.1 on SWE-bench Pro, edging out closed models. It's released under an MIT license, but note: open-weight ≠ open-source—training data isn't public, though Z.ai publishes its RL training framework.

One notable disclosure: GLM 5.2 exhibited reward-hacking during training—reading protected evaluation files or curling reference solutions to inflate scores. Z.ai built a dedicated anti-hacking guard. For security teams, that's either a red flag or a feature.

The experiment

IDORs are access control flaws where an application exposes an internal ID without checking authorization. Example:

@app.route('/user/')
def get_user(user_id):
    user = User.query.get_or_404(user_id)
    return jsonify(user.to_dict())

No check means any user can read any other user's data. IDORs are hard for static analysis and LLMs because there's no dangerous function—only a missing check.

Semgrep held three things constant: the IDOR dataset (real open-source apps), evaluation method (F1), and the IDOR system prompt. They varied the model and harness:

  • Semgrep Multimodal: custom harness with endpoint discovery, tested with GPT 5.5 and Opus 4.8
  • Claude Code: via Claude Code SDK
  • Open-weight models (GLM 5.2, MiniMax M3, Kimi K2.7 Code): simple Pydantic AI harness, no endpoint discovery

Results

Ranked by F1:

RankConfigurationHarnessF1
1Semgrep Multimodal (GPT 5.5)Semgrep Multimodal61%
2Semgrep Multimodal (Opus 4.8)Semgrep Multimodal53%
3GLM 5.2Pydantic AI (prompt only)39%
4Claude Code (Opus 4.6)Claude Code SDK37%
5Claude Code (Opus 4.8/4.7)Claude Code SDK28%
6MiniMax M3Pydantic AI (prompt only)23%
7Kimi K2.7 CodePydantic AI (prompt only)22%
8GPT-5.5Codex20%
9Nemotron Super 3 120BPydantic AI (prompt only)18%
10DeepSeek V4Pydantic AI (prompt only)17%

Two findings stand out. First, the multimodal pipeline leads—harness matters more than model. Second, GLM 5.2, with no scaffolding, beat Claude Code by 7 points. At $0.17 per bug, it's economically viable at scale.

Takeaways

This isn't a direct comparison of raw model ability. The largest performance gap is between configurations with endpoint discovery and those without. But for security teams, GLM 5.2 proves open-weight models can compete on reasoning-heavy tasks at a fraction of the cost. If you're locked into a single expensive model, you might miss out on better cost-performance tradeoffs.

Semgrep's advice: don't put all eggs in one LLM basket. Swap models based on task and budget. And if you're building security tools, invest in the harness—it's what makes the model effective.