WarpSpeed Achieves 2.24x Speedup on NVIDIA's Blackwell Kernel Benchmark
Key Takeaways
- ▸WarpSpeed beats NVIDIA's optimized PyTorch baselines on 90% of SOL-ExecBench kernels with 2.24x average speedup after one day of search
- ▸Significantly outperforms previous benchmark leader Cursor (90% vs 63% win rate, 2.24x vs 1.38x speedup) while using 1/21st the optimization time
- ▸Strongest results on quantization kernels (NVFP4, FP8), with up to 14.9x speedup on production-critical attention mechanisms
Summary
doubleAI's WarpSpeed, an AI-powered performance engineering system, has achieved remarkable results on NVIDIA's SOL-ExecBench—a benchmark comprising 235 production CUDA kernels from real models including DeepSeek, Qwen, Gemma, and Stable Diffusion. After just one day of optimization search, WarpSpeed beat NVIDIA's own optimized PyTorch baselines on 90% of kernels with an average speedup of 2.24x, significantly outperforming the prior benchmark leader Cursor, which achieved a 63% win rate with 1.38x speedup after three weeks of optimization.
The system demonstrated particularly strong performance on quantization kernels, the core of modern efficient inference. Most impressively, WarpSpeed achieved a 14.9x speedup on an NVFP4 grouped-query attention kernel, described as running "essentially at the speed of light" for this workload. WarpSpeed delivered consistent gains across all four benchmark problem sets (atomic operations, fused blocks, low-precision kernels, and inference primitives), demonstrating broad applicability.
DoubleAI emphasizes verification as paramount, with a verification framework designed to prevent "reward hacking" and ensure correctness alongside speed. The results highlight how agentic AI systems can optimize specialized hardware workloads more efficiently than human experts, with important implications for reducing inference costs in production environments.
- Verification framework ensures correctness alongside performance, preventing optimization-induced bugs in production kernels
Editorial Opinion
WarpSpeed's results demonstrate a meaningful leap forward in AI-driven performance engineering, showing that agentic systems can out-optimize hand-tuned kernels on specialized hardware in a fraction of the time. The ability to achieve 2.24x speedups on NVIDIA's own baselines—beating them on 90% of problems after just one day—is genuinely impressive and could unlock significant inference cost reductions at scale. The emphasis on verification alongside performance is particularly noteworthy; in production systems, a fast kernel that silently produces wrong results is worse than useless. If these results hold up under real-world deployment, this could reshape how efficiently modern large language models and other neural networks run in inference.



