Stanford Researchers Develop Multi-Agent AI System to Improve HIP Kernel Generation for AMD GPUs
Key Takeaways
- ▸Stanford researchers developed a multi-agent AI system combining synthetic data generation, evolutionary optimization, and GRPO reinforcement learning to improve HIP kernel generation for AMD GPUs, addressing a significant gap where LLMs excel at CUDA but struggle with AMD's HIP language
- ▸The approach uses a cost-effective open-source Qwen2.5-Coder-14B model alongside Google's Gemini for synthetic data generation, creating an accessible alternative to proprietary solutions for kernel optimization
- ▸Results show improvements in compilation and correctness rates, but researchers emphasize that production-level performance gains require deeper hardware awareness and profiler-based optimization beyond current LLM capabilities
Summary
The Scaling Intelligence Lab at Stanford University has developed a new approach to improve the generation of HIP (Heterogeneous Interface for Portability) kernels for AMD GPUs using large language models, synthetic data, and reinforcement learning. The research addresses a significant ecosystem imbalance where LLMs generate high-quality CUDA kernels for NVIDIA GPUs but frequently struggle with AMD's HIP language, producing hallucinated APIs and kernels that fail at compile time. The team created a synthetic dataset of 500 PyTorch reference tasks using mutation, composition, and constraint-based generation, then developed a multi-agent optimization pipeline with specialized agents for task generation, PyTorch-to-HIP translation, hardware evaluation, and evolutionary optimization. They trained an open-source Qwen2.5-Coder-14B model using supervised fine-tuning (SFT) followed by GRPO (Group Relative Policy Optimization) reinforcement learning to directly reward correctness and speedup on AMD MI350X GPUs.
The results demonstrated improvements in compilation and correctness rates across all KernelBench levels, with reinforcement learning providing the strongest gains. However, the researchers noted that achieving meaningful speedup improvements beyond PyTorch still requires deeper hardware awareness and optimization reasoning. The work uses Google's Gemini-2.5-Flash model in the multi-agent pipeline to generate diverse and verified kernel tasks, demonstrating how advanced LLMs can collaborate to solve complex code generation problems.
Editorial Opinion
This research addresses a real limitation in the AI accelerator ecosystem: LLMs generate fluent CUDA but struggle with AMD's HIP language, creating a productivity gap as AMD GPUs gain market adoption. The multi-agent approach is innovative, combining synthetic data, evolutionary search, and RL to systematically improve both correctness and performance on hardware. However, the authors' own conclusion that deeper hardware awareness is still needed suggests that general-purpose LLMs may be reaching their optimization limits without more specialized architectural innovations or tighter integration with hardware profilers.



