doubleAI's WarpSpeed Shatters GPU Kernel Benchmark, Vastly Outperforming Cursor
Key Takeaways
- ▸WarpSpeed achieved 2.24× average speedup on 90% of 235 Blackwell kernels in a single day—far exceeding Cursor's 1.38× on 63% after 3 weeks of search
- ▸Exceptional results on quantization kernels (FP8, NVFP4) with select kernels achieving up to 14.9× speedup—critical for modern LLM inference
- ▸doubleAI prioritizes correctness through rigorous verification frameworks that prevent reward hacking and ensure real-world reliability
Summary
doubleAI announced that its WarpSpeed artificial expert intelligence system achieved breakthrough results on NVIDIA's SOL-ExecBench, a benchmark of 235 of the hardest CUDA kernels from production models. Running for just a single day, WarpSpeed beat NVIDIA's optimized PyTorch baselines on 90% of the problems, achieving an average speedup of 2.24×.
The achievement dramatically outperforms Cursor's previously announced benchmark results from April 2026. Cursor's multi-agent system required three weeks of computation to beat the baseline on 63% of problems with a 1.38× average speedup. WarpSpeed achieved superior performance across all four problem sets (atomic single-op kernels, fused multi-op blocks, quantization kernels, and inference primitives) in a fraction of the time.
Performance was particularly exceptional on quantization kernels (FP8 and NVFP4 attention), with some kernels running 14.9× faster than the optimized reference baseline. doubleAI emphasizes that verification and correctness are paramount, with the company treating its evaluation harness and verification framework as critical safeguards against 'reward hacking'—where a system produces fast but incorrect kernels.
- Consistent gains across all four benchmark categories (L1, L2, Quant, FlashInfer-Bench), demonstrating broad applicability to production workloads
Editorial Opinion
WarpSpeed's results represent a watershed moment in automated kernel optimization, demonstrating that AI-driven systems can now exceed weeks of multi-agent effort in just hours. The dramatic improvements on quantization kernels underscore the growing importance of specialized optimization in modern inference—a domain where hand-crafted engineering has traditionally been the only path to peak performance. By coupling aggressive optimization with verification-first methodology, doubleAI has set a new competitive standard that will likely reshape GPU kernel engineering practices across the industry.



