AMD MI355X Delivers 80% of B200 Performance at <50% Cost for GLM-5.2

Wafer published benchmarks showing their MI355X cluster hitting 2626 tok/s/node at 2.4 requests per second (RPS) on a 20k input / 1k output workload with 60% cache hit rate. That's 80% of the B200's 3192 tok/s at 3.0 RPS, but at 2.75x cheaper per GPU. The single-stream GLM-5.2 decode hit 213 tok/s on 10k in / 1.5k out tokens.

MXFP4 Quantization with AMD Quark

Wafer quantized GLM-5.2 from bf16 to MXFP4 using AMD Quark. Compared to z-ai's official FP8 quantization, MXFP4 was lossless on GPQA-Diamond, tau2, and GSM8K. The eval table:

EvalFP8 baselineMXFP4Δ
GSM8K (200q, 5-shot, greedy)0.965 ± 0.0130.955 ± 0.014-0.010
GPQA-Diamond (198q × 2 seeds, temp 1.0)0.9217 ± 0.0270.9026 ± 0.029-0.019
tau2 macro0.8190.834+0.015

The MXFP4 weights halve memory bandwidth requirements, critical for AMD's MI355X memory bandwidth of 3.35 TB/s.

Inference Framework: sglang Over vLLM and ATOM

vLLM had no working MXFP4 + GlmMoeDsa path. ATOM's output degraded at long context. sglang had native MXFP4 support and the least friction. Two patches were needed to enable MTP speculative decode:

  1. MTP head quantization mismatch: The MTP head's shared expert is stored in bf16 but registered under a different module prefix (model.decoder.* vs model.layers.78.mlp.shared_experts.*). Quark records un-quantized layers by name; Wafer copied the layer 78 entries under the decoder prefix sglang uses. This fixed the shape mismatch crash.

  2. CUDA-ism in multi-step kernel: The fused multi-step metadata kernel for draft depth ≥4 includes #include without a ROCm guard. One #ifdef USE_ROCM guard fixed it.

With spec dec working, Wafer enabled --kv-cache-dtype fp8_e4m3 and --enable-aiter-allreduce-fusion to reach 213 tok/s single stream.

Prefill Optimization: TP4×DP2 and Custom MoE Kernels

At TP8, MI355X hit 1461 tok/s/node prefill. Switching to TP4×DP2 improved to 1944 tok/s/node at 2.0 RPS. But sglang's MoE kernel was falling back to a slow FlyDSL heuristic for fp4 shapes. Wafer tuned the MoE kernel selection manually for GLM-5.2's fp4 shapes (model_dim 6144, moe_inter 2048, E=256, topk=8), reaching 2626 tok/s/node at 2.4 RPS.

Why This Matters

Wafer proved that with proper quantization and framework tuning, AMD's MI355X can deliver competitive inference performance without custom CUDA kernels. The CUDA moat is eroding as AMD's ROCm stack and tools like Quark mature. For developers, this means cost-effective alternatives to NVIDIA exist today for production inference workloads.