AMD MI355X Delivers 80% of B200 Performance at <50% Cost for GLM-5.2
Wafer published benchmarks showing their MI355X cluster hitting 2626 tok/s/node at 2.4 requests per second (RPS) on a 20k input / 1k output workload with 60% cache hit rate. That's 80% of the B200's 3192 tok/s at 3.0 RPS, but at 2.75x cheaper per GPU. The single-stream GLM-5.2 decode hit 213 tok/s on 10k in / 1.5k out tokens.
MXFP4 Quantization with AMD Quark
Wafer quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, MXFP4 was lossless on GPQA-Diamond, tau2, and GSM8K. The eval table:
| Eval | FP8 baseline | MXFP4 | Δ |
|---|---|---|---|
| GSM8K (200q, 5-shot, greedy) | 0.965 ± 0.013 | 0.955 ± 0.014 | -0.010 |
| GPQA-Diamond (198q × 2 seeds, temp 1.0) | 0.9217 ± 0.027 | 0.9026 ± 0.029 | -0.019 |
| tau2 macro | 0.819 | 0.834 | +0.015 |
The MXFP4 weights halve memory bandwidth requirements, critical for AMD's MI355X memory bandwidth of 3.35 TB/s.
Inference Framework: sglang Over vLLM and ATOM
vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang had native MXFP4 support and the least friction. Two patches were needed to enable MTP speculative decode:
-
MTP head quantization mismatch: The MTP head's shared expert is stored in bf16 but registered under a different module prefix (
model.decoder.*vsmodel.layers.78.mlp.shared_experts.*). Quark records un-quantized layers by name; Wafer copied the layer 78 entries under the decoder prefix sglang uses. This fixed the shape mismatch crash. -
CUDA-ism in multi-step kernel: The fused multi-step metadata kernel for draft depth ≥4 includes
#includewithout a ROCm guard. One#ifdef USE_ROCMguard fixed it.
With spec dec working, Wafer enabled --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion to reach 213 tok/s single stream.
Prefill Optimization: TP4×DP2 and Custom MoE Kernels
At TP8, MI355X hit 1461 tok/s/node prefill. Switching to TP4×DP2 improved to 1944 tok/s/node at 2.0 RPS. But sglang's MoE kernel was falling back to a slow FlyDSL heuristic for fp4 shapes. Wafer tuned the MoE kernel selection manually for GLM-5.2's fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), reaching 2626 tok/s/node at 2.4 RPS.
Why This Matters
Wafer proved that with proper quantization and framework tuning, AMD's MI355X can deliver competitive inference performance without custom CUDA kernels. The CUDA moat is eroding as AMD's ROCm stack and tools like Quark mature. For developers, this means cost-effective alternatives to NVIDIA exist today for production inference workloads.


