AMD MI355X Proves Competitive for Frontier AI Inference at 2.75x Lower Cost Than Blackwell
Key Takeaways
- ▸AMD MI355X achieves 80% of NVIDIA B200 performance at 2.75x lower cost per GPU, making it cost-effective for frontier model inference
- ▸Advanced quantization (MXFP4) and speculative decoding can deliver near-3x throughput gains, demonstrating that software optimization can equalize hardware performance gaps
- ▸AMD's ROCm ecosystem requires more engineering effort than NVIDIA's day-0 support but is becoming mature for production inference workloads
Summary
Wafer has demonstrated that AMD's MI355X GPU can serve Baichuan's GLM5.2 frontier language model with competitive performance at significantly lower cost than NVIDIA's Blackwell. The optimization achieved 2626 tokens per second per node on a production-scale workload with defined latency targets, while costing 2.75x less per GPU than NVIDIA's B300. This validates that AMD's hardware is emerging as a genuine alternative for large-scale AI inference serving despite NVIDIA's historical software and day-0 support advantages.
The engineering effort required to achieve competitive performance reveals both the challenge and opportunity in AMD's ROCm ecosystem. Using MXFP4 quantization via AMD's Quark tool combined with the sglang inference framework, engineers optimized GLM5.2 for the MI355X. Critical optimizations included implementing speculative decoding and fixing compatibility issues between quantization layer naming conventions and multi-token prediction heads—fixes that required only targeted code changes but were essential for unlocking near-3x single-stream throughput gains.
This result carries significant implications for AI infrastructure economics. With frontier models releasing every two weeks and NVIDIA GPU scarcity driving token prices higher, AMD's lower-cost hardware at comparable performance offers a compelling alternative. The work demonstrates that optimization techniques and maturing open-source frameworks are rapidly closing the gap that once strongly favored NVIDIA, potentially accelerating hardware competition in the AI inference market.
- Cost-effective alternatives to NVIDIA are emerging as AI token demand skyrockets, potentially reshaping infrastructure economics for AI service providers
Editorial Opinion
AMD's emergence in AI inference is a reminder that hardware competition isn't won on silicon alone—it's won through the combined force of hardware, software, and optimization tooling. This benchmark is significant not because AMD now beats NVIDIA (it doesn't), but because it proves the performance gap is closeable through engineering rather than being a fundamental silicon limitation. As optimization frameworks mature and engineers continue improving kernel support, we should expect frontier models to run efficiently on AMD hardware by default. This competition is healthy and will accelerate both companies' focus on cost-efficiency, ultimately benefiting inference providers racing to scale AI services.



