AutoMegaKernel: Claude Code Autonomously Compiles LLMs Into Provably-Correct GPU Megakernels
Key Takeaways
- ▸Claude Code serves as an autonomous agent for GPU kernel synthesis, automatically designing provably-correct megakernels through a formal pipeline including deadlock/race checking, verification against eager execution, and self-tuning across unattended runs
- ▸AutoMegaKernel's int8 auto-tuned kernel outperforms CUDA-graphed cuBLAS on inference-class GPUs, with the performance advantage derived from reduced weight memory bandwidth (8-bit vs. 16-bit weights) rather than algorithmic innovation
- ▸The entire LLM forward pass compiles into a single persistent kernel launch with cross-SM synchronization, achieving numerical accuracy matching PyTorch to fp32 precision across different GPU architectures
Summary
RightNow-AI has released AutoMegaKernel, an open-source system that uses Claude Code as an autonomous agent to compile entire LLM forward passes into single, optimized GPU megakernels with formal correctness guarantees. The system performs a complete pipeline of importing, lowering, validation (deadlock and race-condition checking), verification against eager PyTorch execution, and GPU kernel generation, then autonomously self-tunes the kernel across successive runs. The initial release targets HuggingFace Llama models on NVIDIA CUDA GPUs (compute capabilities sm_75 to sm_120).
Results show AutoMegaKernel's auto-tuned int8 (W8A16, near-lossless quantization) megakernel outperforms CUDA-graphed cuBLAS bf16 across inference-class GPUs including the L40S (864 GB/s) and A10G (600 GB/s). The performance advantage is driven by the int8 path reading half the weight bytes compared to bf16 approaches, allowing smaller models to better amortize fixed cross-SM synchronization costs. The authors acknowledge their bf16 implementation currently trails cuBLAS by approximately 1.24× and do not claim parity at equal precision.
The system runs the entire forward pass as a single persistent kernel launch with thread-block-level synchronization across SMs, achieving numerical accuracy matching PyTorch's eager execution to ~1e-7 in fp32 and bf16 tolerance. AutoMegaKernel automatically self-retargets across GPU architectures from identical source code, with performance validated on A100, H100, and RTX 5090 hardware. The tool is available open-source under the MIT license.
- The system automatically self-retargets across GPU compute capabilities (sm_75 to sm_120) from identical source code, with validated performance on A100, H100, and RTX 5090 hardware
- Currently limited to Llama models and bf16 performance parity with cuBLAS; practical value depends on whether the int8 gains generalize beyond quantized models
Editorial Opinion
AutoMegaKernel represents a sophisticated application of Claude Code as a systems-level agent, embedding formal correctness guarantees into the GPU kernel synthesis pipeline. The int8 performance wins are meaningful for quantized inference workloads, but the authors' candor about trailing cuBLAS at bf16 appropriately sets expectations—current wins depend on 8-bit quantization. The self-retargeting capability and provable correctness addressing real pain points in GPU development, though broader production applicability hinges on whether planned bf16 optimizations can match equal-precision performance.


