AMD MI355X Beats B200 on Performance Per Dollar for GLM-5.2

AMD MI355X Beats B200 on Performance Per Dollar for GLM-5.2 Inference

Wafer achieves 2626 tok/s/node on AMD MI355X with GLM-5.2 at 2.4 RPS, reaching 80% of B200 throughput at under half the cost. The key was MXFP4 quantization via AMD Quark, sglang with MTP speculative decode, and custom MoE kernel tuning. The CUDA moat is eroding as AMD's software stack matures.

3 min readJul 4, 2026

AMD MI355X Beats B200 on Performance Per Dollar for GLM-5.2 Inference

AMD MI355X Delivers 80% of B200 Performance at <50% Cost for GLM-5.2

Wafer published benchmarks showing their MI355X cluster hitting 2626 tok/s/node at 2.4 requests per second (RPS) on a 20k input / 1k output workload with 60% cache hit rate. That's 80% of the B200's 3192 tok/s at 3.0 RPS, but at 2.75x cheaper per GPU. The single-stream GLM-5.2 decode hit 213 tok/s on 10k in / 1.5k out tokens.

MXFP4 Quantization with AMD Quark

Wafer quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, MXFP4 was lossless on GPQA-Diamond, tau2, and GSM8K. The eval table:

Eval	FP8 baseline	MXFP4	Δ
GSM8K (200q, 5-shot, greedy)	0.965 ± 0.013	0.955 ± 0.014	-0.010
GPQA-Diamond (198q × 2 seeds, temp 1.0)	0.9217 ± 0.027	0.9026 ± 0.029	-0.019
tau2 macro	0.819	0.834	+0.015

The MXFP4 weights halve memory bandwidth requirements, critical for AMD's MI355X memory bandwidth of 3.35 TB/s.

Inference Framework: sglang Over vLLM and ATOM

vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang had native MXFP4 support and the least friction. Two patches were needed to enable MTP speculative decode:

MTP head quantization mismatch: The MTP head's shared expert is stored in bf16 but registered under a different module prefix (model.decoder.* vs model.layers.78.mlp.shared_experts.*). Quark records un-quantized layers by name; Wafer copied the layer 78 entries under the decoder prefix sglang uses. This fixed the shape mismatch crash.
CUDA-ism in multi-step kernel: The fused multi-step metadata kernel for draft depth ≥4 includes #include without a ROCm guard. One #ifdef USE_ROCM guard fixed it.

With spec dec working, Wafer enabled --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion to reach 213 tok/s single stream.

Prefill Optimization: TP4×DP2 and Custom MoE Kernels

At TP8, MI355X hit 1461 tok/s/node prefill. Switching to TP4×DP2 improved to 1944 tok/s/node at 2.0 RPS. But sglang's MoE kernel was falling back to a slow FlyDSL heuristic for fp4 shapes. Wafer tuned the MoE kernel selection manually for GLM-5.2's fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), reaching 2626 tok/s/node at 2.4 RPS.

Why This Matters

Wafer proved that with proper quantization and framework tuning, AMD's MI355X can deliver competitive inference performance without custom CUDA kernels. The CUDA moat is eroding as AMD's ROCm stack and tools like Quark mature. For developers, this means cost-effective alternatives to NVIDIA exist today for production inference workloads.

Editor's Take

I've been watching AMD's ROCm progress for years, and this is the first time I've seen a real-world benchmark that makes me consider switching from NVIDIA. The fact that Wafer didn't write any custom kernels — just framework patches and kernel tuning — is a big deal. If you're running inference at scale, it's worth testing AMD hardware now. The software gap is closing fast.

— DevDigest Editorial

Key Takeaways

•Use AMD Quark for MXFP4 quantization: it's lossless vs FP8 on key benchmarks and halves memory bandwidth requirements.
•Choose sglang over vLLM for AMD: sglang has native MXFP4 support and required only two small patches for MTP speculative decode.
•For prefill-heavy workloads on AMD, use TP4×DP2 instead of TP8, and tune MoE kernel selection manually for your model's shapes.

Why It Matters

NVIDIA GPU prices are climbing due to demand outstripping supply. AMD's MI355X offers a 2.75x cost advantage with comparable performance after tuning. Developers can now achieve competitive inference performance per dollar without writing custom CUDA kernels, thanks to AMD Quark and sglang.

#amd#rocm#inference#SGLang#GLM-5.2

Get the weekly digest

Every Sunday - top tech stories, industry breakthroughs, and developer tools delivered to your inbox.

No spam, unsubscribe anytime.

AMD MI355X Beats B200 on Performance Per Dollar for GLM-5.2 Inference

AMD MI355X Delivers 80% of B200 Performance at <50% Cost for GLM-5.2

MXFP4 Quantization with AMD Quark

Inference Framework: sglang Over vLLM and ATOM

Prefill Optimization: TP4×DP2 and Custom MoE Kernels

Why This Matters

Editor's Take

Key Takeaways

Why It Matters

Get the weekly digest

You might also like

Dan Luu on AI Coding: Testing Without Review Beats Human Review

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

Mistral Releases Leanstral 1.5: 6B Model Saturates miniF2F, Finds 5 Bugs

Google's TabFM: Zero-shot tabular classification without training

MSI Center Vulnerability Grants SYSTEM Privileges via Named Pipe

GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell