Comprehensive WebGPU LLM Inference Benchmark Reveals Dispatch Overhead Challenges Across GPU Vendors and Browsers
Key Takeaways
- ▸WebGPU dispatch overhead varies significantly by backend (Vulkan vs Metal) and implementation, with dispatch costs representing the primary performance bottleneck at batch size 1 inference
- ▸Naive single-operation benchmarks overestimate true dispatch costs by ~20×, highlighting the importance of sequential-dispatch methodology for accurate performance characterization
- ▸Kernel fusion strategies show backend-dependent effectiveness: 53% throughput improvement on Vulkan but no benefit on CUDA, indicating optimization must be tailored to underlying graphics API
Summary
A new research paper presents the first systematic characterization of WebGPU dispatch overhead for large language model inference, testing across four GPU vendors (NVIDIA, AMD, Apple, Intel), multiple backends and browsers, and two model sizes. The study reveals that naive benchmarks overestimate dispatch costs by approximately 20×, with actual per-dispatch API overhead ranging from 24–36 microseconds on Vulkan to 32–71 microseconds on Metal. Researchers developed torch-webgpu, a PyTorch backend for WebGPU, achieving 11–12% of CUDA performance on their reference platform. The findings demonstrate that per-operation overhead dominates performance at batch size 1, with kernel fusion improving Vulkan throughput by 53% while providing no benefit on CUDA, confirming dispatch overhead as a critical differentiator in WebGPU optimization.
- Cross-platform testing across Windows, macOS, and Linux reveals substantial performance variation within the same backend (2.2× difference for Metal implementations), suggesting implementation quality is critical
Editorial Opinion
This research provides valuable empirical data for developers targeting browser-based LLM inference, a growing use case for edge AI deployment. However, the finding that WebGPU achieves only 11–12% of CUDA performance raises questions about whether browser-based inference is practical for performance-critical applications without significant architectural changes. The open-source release of benchmarks and code will benefit the community, though the results suggest that reducing dispatch overhead through WebGPU specification refinements may be necessary for broader adoption in inference workloads.

