GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell
Wafer published benchmarks showing GLM5.2 running on AMD MI355X GPUs achieves 2626 tok/s/node aggregate throughput at 2.4 requests per second (RPS) with a p50 time-to-first-token (TTFT) of 0.81s and p95 of 2.22s. That's 80% of the performance measured on a B200, but at over 2x lower cost.
The AMD advantage
AMD's MI355X costs about 2.75x less than NVIDIA's B300 per GPU, with comparable hardware specs. But NVIDIA's software advantage (CUDA, day-0 support) typically lets providers serve inference much faster. Wafer had to overcome several ROCm-specific hurdles to close the gap.
Quantization: MXFP4 with AMD Quark
The team quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, the MXFP4 version was essentially lossless:
| Eval | FP8 baseline | MXFP4 | Δ |
|---|---|---|---|
| GSM8K (200q, 5-shot, greedy) | 0.965 ± 0.013 | 0.955 ± 0.014 | −0.010 |
| GPQA-Diamond (198q × 2 seeds, temp 1.0) | 0.9217 ± 0.027 | 0.9026 ± 0.029 | −0.019 |
| tau2 macro | 0.819 | 0.834 | +0.015 |
Inference engine: sglang over vLLM and ATOM
Wafer evaluated three frameworks: vLLM, ATOM, and sglang. vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang offered native MXFP4 support with least friction.
Speculative decode fixes
Two bugs blocked speculative decode on the ROCm sglang image:
-
MTP head weight prefix mismatch: The MTP head's shared expert weights are stored in bf16 under
model.decoder.*, but Quark's quantization lookup expectedmodel.layers.78.mlp.shared_experts.*. Fix: copy layer 78 entries to the quantization exception list under the decoder prefix. -
Missing ROCm guard in multi-step kernel: The fused multi-step metadata kernel included
#includewithout a ROCm guard. Fix: add#ifdef USE_ROCM.
After these fixes, speculative decode provided nearly 3x single-stream throughput gain, reaching 213 tok/s on a 10k input / 1.5k output workload.
Prefill optimization: TP4×DP2 and MoE kernel tuning
For aggregate throughput on a 20k input / 1k output workload with 60% cache hit rate, prefill is the bottleneck. Switching from TP8 to TP4×DP2 improved throughput from 1461 to 1944 tok/s/node at 2.0 RPS. But the real gain came from tuning the MoE kernel selection.
Sglang's ROCm image used a slow FlyDSL heuristic fallback for GLM-5.2's fp4 MoE shapes (model_dim 6144, moe_inter 2048, E=256, topk=8). Wafer tuned the kernel selection manually, achieving 2626 tok/s/node at 2.4 RPS — a 35% improvement over the untuned TP4×DP2 config.
Why this matters
This is the first public benchmark showing GLM5.2 running efficiently on AMD hardware. The team wrote no custom kernels — only configuration and minor code fixes. The CUDA moat is eroding: AMD's software stack is maturing to the point where SOTA inference is achievable with modest engineering effort.
Next steps for developers
- Try AMD Quark for MXFP4 quantization on your own models.
- Use sglang with ROCm for inference; contribute fixes for speculative decode.
- Profile your MoE kernel selection; the default fallback may be suboptimal.
- Compare cost-per-token on MI355X vs B200 for your workload.





