BotBeat
...
← Back

> ▌

DoubleAIDoubleAI
RESEARCHDoubleAI2026-05-27

WarpSpeed Achieves 2.24x Speedup on NVIDIA's Blackwell Kernel Benchmark

Key Takeaways

  • ▸WarpSpeed beats NVIDIA's optimized PyTorch baselines on 90% of SOL-ExecBench kernels with 2.24x average speedup after one day of search
  • ▸Significantly outperforms previous benchmark leader Cursor (90% vs 63% win rate, 2.24x vs 1.38x speedup) while using 1/21st the optimization time
  • ▸Strongest results on quantization kernels (NVFP4, FP8), with up to 14.9x speedup on production-critical attention mechanisms
Source:
Hacker Newshttps://www.doubleai.com/research/warpspeed-approaches-speed-of-light-on-blackwell↗

Summary

doubleAI's WarpSpeed, an AI-powered performance engineering system, has achieved remarkable results on NVIDIA's SOL-ExecBench—a benchmark comprising 235 production CUDA kernels from real models including DeepSeek, Qwen, Gemma, and Stable Diffusion. After just one day of optimization search, WarpSpeed beat NVIDIA's own optimized PyTorch baselines on 90% of kernels with an average speedup of 2.24x, significantly outperforming the prior benchmark leader Cursor, which achieved a 63% win rate with 1.38x speedup after three weeks of optimization.

The system demonstrated particularly strong performance on quantization kernels, the core of modern efficient inference. Most impressively, WarpSpeed achieved a 14.9x speedup on an NVFP4 grouped-query attention kernel, described as running "essentially at the speed of light" for this workload. WarpSpeed delivered consistent gains across all four benchmark problem sets (atomic operations, fused blocks, low-precision kernels, and inference primitives), demonstrating broad applicability.

DoubleAI emphasizes verification as paramount, with a verification framework designed to prevent "reward hacking" and ensure correctness alongside speed. The results highlight how agentic AI systems can optimize specialized hardware workloads more efficiently than human experts, with important implications for reducing inference costs in production environments.

  • Verification framework ensures correctness alongside performance, preventing optimization-induced bugs in production kernels

Editorial Opinion

WarpSpeed's results demonstrate a meaningful leap forward in AI-driven performance engineering, showing that agentic systems can out-optimize hand-tuned kernels on specialized hardware in a fraction of the time. The ability to achieve 2.24x speedups on NVIDIA's own baselines—beating them on 90% of problems after just one day—is genuinely impressive and could unlock significant inference cost reductions at scale. The emphasis on verification alongside performance is particularly noteworthy; in production systems, a fast kernel that silently produces wrong results is worse than useless. If these results hold up under real-world deployment, this could reshape how efficiently modern large language models and other neural networks run in inference.

Machine LearningMLOps & InfrastructureAI HardwareScience & Research

More from DoubleAI

DoubleAIDoubleAI
RESEARCH

doubleAI's WarpSpeed Shatters GPU Kernel Benchmark, Vastly Outperforming Cursor

2026-05-24
DoubleAIDoubleAI
PRODUCT LAUNCH

DoubleAI's WarpSpeed Achieves Up to 100x Speedup on NVIDIA's cuGraph Library Using AI-Powered Optimization

2026-03-02

Comments

Suggested

Research CommunityResearch Community
RESEARCH

FuzzingBrain V2: Multi-Agent LLM System Discovers 29 Zero-Day Vulnerabilities with 90% Detection Rate

2026-05-27
PostHogPostHog
PRODUCT LAUNCH

PostHog Plans to Train AI Models on Customer Data to Power New Product Intelligence Features

2026-05-27
Google / AlphabetGoogle / Alphabet
PRODUCT LAUNCH

Google Rebrands Vertex AI as Gemini Enterprise Agent Platform

2026-05-27
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us