GLM5.2 on AMD MI355X: 2626 tok/s at 2x lower cost than Blackwell

Wafer published benchmarks showing GLM5.2 running on AMD MI355X GPUs achieves 2626 tok/s/node aggregate throughput at 2.4 requests per second (RPS) with a p50 time-to-first-token (TTFT) of 0.81s and p95 of 2.22s. That's 80% of the performance measured on a B200, but at over 2x lower cost.

The AMD advantage

AMD's MI355X costs about 2.75x less than NVIDIA's B300 per GPU, with comparable hardware specs. But NVIDIA's software advantage (CUDA, day-0 support) typically lets providers serve inference much faster. Wafer had to overcome several ROCm-specific hurdles to close the gap.

Quantization: MXFP4 with AMD Quark

The team quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, the MXFP4 version was essentially lossless:

EvalFP8 baselineMXFP4Δ
GSM8K (200q, 5-shot, greedy)0.965 ± 0.0130.955 ± 0.014−0.010
GPQA-Diamond (198q × 2 seeds, temp 1.0)0.9217 ± 0.0270.9026 ± 0.029−0.019
tau2 macro0.8190.834+0.015

Inference engine: sglang over vLLM and ATOM

Wafer evaluated three frameworks: vLLM, ATOM, and sglang. vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang offered native MXFP4 support with least friction.

Speculative decode fixes

Two bugs blocked speculative decode on the ROCm sglang image:

  1. MTP head weight prefix mismatch: The MTP head's shared expert weights are stored in bf16 under model.decoder.*, but Quark's quantization lookup expected model.layers.78.mlp.shared_experts.*. Fix: copy layer 78 entries to the quantization exception list under the decoder prefix.

  2. Missing ROCm guard in multi-step kernel: The fused multi-step metadata kernel included #include without a ROCm guard. Fix: add #ifdef USE_ROCM.

After these fixes, speculative decode provided nearly 3x single-stream throughput gain, reaching 213 tok/s on a 10k input / 1.5k output workload.

Prefill optimization: TP4×DP2 and MoE kernel tuning

For aggregate throughput on a 20k input / 1k output workload with 60% cache hit rate, prefill is the bottleneck. Switching from TP8 to TP4×DP2 improved throughput from 1461 to 1944 tok/s/node at 2.0 RPS. But the real gain came from tuning the MoE kernel selection.

Sglang's ROCm image used a slow FlyDSL heuristic fallback for GLM-5.2's fp4 MoE shapes (model_dim 6144, moe_inter 2048, E=256, topk=8). Wafer tuned the kernel selection manually, achieving 2626 tok/s/node at 2.4 RPS — a 35% improvement over the untuned TP4×DP2 config.

Why this matters

This is the first public benchmark showing GLM5.2 running efficiently on AMD hardware. The team wrote no custom kernels — only configuration and minor code fixes. The CUDA moat is eroding: AMD's software stack is maturing to the point where SOTA inference is achievable with modest engineering effort.

Next steps for developers

  • Try AMD Quark for MXFP4 quantization on your own models.
  • Use sglang with ROCm for inference; contribute fixes for speculative decode.
  • Profile your MoE kernel selection; the default fallback may be suboptimal.
  • Compare cost-per-token on MI355X vs B200 for your workload.