GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Black

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

Wafer's benchmarks show GLM5.2 running on AMD MI355X achieves 2626 tok/s/node at 2.4 RPS with under 5s TTFT, delivering over 2x cost savings versus NVIDIA B200. Key optimizations include MXFP4 quantization with AMD Quark, speculative decode fixes for sglang on ROCm, and custom MoE kernel tuning.

3 min readJul 4, 2026

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

Wafer published benchmarks showing GLM5.2 running on AMD MI355X GPUs achieves 2626 tok/s/node aggregate throughput at 2.4 requests per second (RPS) with a p50 time-to-first-token (TTFT) of 0.81s and p95 of 2.22s. That's 80% of the performance measured on a B200, but at over 2x lower cost.

The AMD advantage

AMD's MI355X costs about 2.75x less than NVIDIA's B300 per GPU, with comparable hardware specs. But NVIDIA's software advantage (CUDA, day-0 support) typically lets providers serve inference much faster. Wafer had to overcome several ROCm-specific hurdles to close the gap.

Quantization: MXFP4 with AMD Quark

The team quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, the MXFP4 version was essentially lossless:

Eval	FP8 baseline	MXFP4	Δ
GSM8K (200q, 5-shot, greedy)	0.965 ± 0.013	0.955 ± 0.014	−0.010
GPQA-Diamond (198q × 2 seeds, temp 1.0)	0.9217 ± 0.027	0.9026 ± 0.029	−0.019
tau2 macro	0.819	0.834	+0.015

Inference engine: sglang over vLLM and ATOM

Wafer evaluated three frameworks: vLLM, ATOM, and sglang. vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang offered native MXFP4 support with least friction.

Speculative decode fixes

Two bugs blocked speculative decode on the ROCm sglang image:

MTP head weight prefix mismatch: The MTP head's shared expert weights are stored in bf16 under model.decoder.*, but Quark's quantization lookup expected model.layers.78.mlp.shared_experts.*. Fix: copy layer 78 entries to the quantization exception list under the decoder prefix.
Missing ROCm guard in multi-step kernel: The fused multi-step metadata kernel included #include without a ROCm guard. Fix: add #ifdef USE_ROCM.

After these fixes, speculative decode provided nearly 3x single-stream throughput gain, reaching 213 tok/s on a 10k input / 1.5k output workload.

Prefill optimization: TP4×DP2 and MoE kernel tuning

For aggregate throughput on a 20k input / 1k output workload with 60% cache hit rate, prefill is the bottleneck. Switching from TP8 to TP4×DP2 improved throughput from 1461 to 1944 tok/s/node at 2.0 RPS. But the real gain came from tuning the MoE kernel selection.

Sglang's ROCm image used a slow FlyDSL heuristic fallback for GLM-5.2's fp4 MoE shapes (model_dim 6144, moe_inter 2048, E=256, topk=8). Wafer tuned the kernel selection manually, achieving 2626 tok/s/node at 2.4 RPS — a 35% improvement over the untuned TP4×DP2 config.

Why this matters

This is the first public benchmark showing GLM5.2 running efficiently on AMD hardware. The team wrote no custom kernels — only configuration and minor code fixes. The CUDA moat is eroding: AMD's software stack is maturing to the point where SOTA inference is achievable with modest engineering effort.

Next steps for developers

Try AMD Quark for MXFP4 quantization on your own models.
Use sglang with ROCm for inference; contribute fixes for speculative decode.
Profile your MoE kernel selection; the default fallback may be suboptimal.
Compare cost-per-token on MI355X vs B200 for your workload.

Editor's Take

I've been skeptical of AMD for inference ever since struggling with ROCm compatibility a year ago. But these results are hard to ignore. The fact that Wafer achieved this without writing custom kernels—just fixing prefix mismatches and adding a #ifdef—shows how far the ecosystem has come. I'm still cautious about multi-node scaling, but for single-node deployments, AMD is now a serious contender.

— DevDigest Editorial

Key Takeaways

•Use AMD Quark for MXFP4 quantization to reduce memory and cost without accuracy loss.
•When deploying on ROCm, expect to patch speculative decode kernels for prefix mismatches and missing CUDA guards.
•Profile MoE kernel selection on your specific model shapes; the default fallback may be suboptimal.

Why It Matters

NVIDIA GPU shortages and rising costs make AMD a viable alternative for inference. This benchmark proves that with proper optimization, AMD's MI355X can deliver competitive performance at half the cost, reducing token prices for developers deploying large models.

#amd#rocm#inference#SGLang#moE

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

The AMD advantage

Quantization: MXFP4 with AMD Quark

Inference engine: sglang over vLLM and ATOM

Speculative decode fixes

Prefill optimization: TP4×DP2 and MoE kernel tuning

Why this matters

Next steps for developers

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F, Finds 5 Bugs

Google's TabFM: Zero-shot tabular classification without training

Moondream's Photon Engine Hides GPU Bubbles with Pipelined Decoding

vLLM Semantic Router: Micro-Agents Inside the Model API

14× Faster Embeddings: Manticore 27.1.5 Ships ONNX Runtime Backend

crustc: The Entire rustc Compiler Translated to 46M Lines of C