K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Key Takeaways

▸K-Search uses a co-evolving world model to guide LLMs in optimizing GPU kernels, achieving up to 14.3x speedup on complex kernels
▸The framework decouples high-level algorithmic planning from low-level implementation, enabling exploration of non-monotonic optimization paths
▸K-Search outperforms both state-of-the-art automated methods and human-designed solutions on the GPUMode TriMul benchmark for H100 GPUs

Source:

Hacker Newshttps://arxiv.org/abs/2602.19128↗

Summary

Researchers from UC Berkeley and affiliates have introduced K-Search, a novel framework that uses large language models to automatically optimize GPU kernels with unprecedented efficiency. The system addresses a critical bottleneck in machine learning infrastructure by treating LLMs not merely as code generators, but as strategic planners that can navigate complex optimization paths. Traditional automated approaches struggle with multi-step structural transformations and often discard promising strategies due to temporary implementation flaws.

K-Search's innovation lies in its "co-evolving world model" that replaces static search heuristics with dynamic, LLM-guided exploration. This approach explicitly separates high-level algorithmic planning from low-level code implementation, allowing the system to pursue non-monotonic optimization paths while remaining resilient to intermediate bugs or inefficiencies. The framework leverages the domain knowledge encoded in LLMs to actively explore the optimization space rather than relying on rigid heuristics.

Benchmark results demonstrate substantial performance gains across diverse kernel types. On complex kernels from FlashInfer—including Group Query Attention (GQA), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE)—K-Search achieved an average 2.10x improvement over state-of-the-art evolutionary search methods, with peak gains reaching 14.3x on MoE kernels. The system also achieved state-of-the-art performance on the GPUMode TriMul task for H100 GPUs, reaching 1030 microseconds and surpassing both previous automated solutions and human-designed implementations.

This research represents a significant step toward automating kernel optimization, a traditionally expert-intensive process that becomes increasingly critical as GPU architectures evolve rapidly. By enabling LLMs to reason about optimization strategies at a higher level of abstraction, K-Search could accelerate development cycles for machine learning systems and reduce the expertise barrier for achieving peak hardware performance.

The approach addresses a critical limitation of existing methods that treat LLMs as simple code generators within heuristic-guided loops

Editorial Opinion

K-Search represents a fascinating evolution in how we leverage LLMs for systems optimization—moving from pattern-matching code generation to strategic reasoning about performance trade-offs. The ability to maintain promising optimization trajectories despite temporary implementation failures mirrors how human experts approach kernel tuning, suggesting we're approaching genuinely intelligent automated optimization. The 14x gains on complex kernels aren't just incremental improvements; they indicate the framework is discovering optimization strategies that differ fundamentally from both evolutionary search and human intuition, which could reshape how we think about the boundary between automated and expert-driven performance engineering.

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Key Takeaways

▸K-Search uses a co-evolving world model to guide LLMs in optimizing GPU kernels, achieving up to 14.3x speedup on complex kernels
▸The framework decouples high-level algorithmic planning from low-level implementation, enabling exploration of non-monotonic optimization paths
▸K-Search outperforms both state-of-the-art automated methods and human-designed solutions on the GPUMode TriMul benchmark for H100 GPUs

Summary

The approach addresses a critical limitation of existing methods that treat LLMs as simple code generators within heuristic-guided loops

Editorial Opinion

K-Search represents a fascinating evolution in how we leverage LLMs for systems optimization—moving from pattern-matching code generation to strategic reasoning about performance trade-offs. The ability to maintain promising optimization trajectories despite temporary implementation failures mirrors how human experts approach kernel tuning, suggesting we're approaching genuinely intelligent automated optimization. The 14x gains on complex kernels aren't just incremental improvements; they indicate the framework is discovering optimization strategies that differ fundamentally from both evolutionary search and human intuition, which could reshape how we think about the boundary between automated and expert-driven performance engineering.

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Key Takeaways

Summary

Editorial Opinion

More from UC Berkeley

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

UC Berkeley Researchers Introduce ENPIRE: Autonomous Framework for Real-World Robot Policy Improvement

UC Berkeley ADRS Project Explores Memory Management for AI-Driven GPU Code Generation

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Key Takeaways

Summary

Editorial Opinion

More from UC Berkeley

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

UC Berkeley Researchers Introduce ENPIRE: Autonomous Framework for Real-World Robot Policy Improvement

UC Berkeley ADRS Project Explores Memory Management for AI-Driven GPU Code Generation

Comments

Suggested

Alibaba's Elements Claw AI Agent Discovers Four New Superconductors

Nvidia Moves Beyond Chip Sales to Finance AI Infrastructure Boom

Apple Container 1.0 Reaches Stable Release: Native macOS Docker Alternative Now GA