ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance
Key Takeaways
- ▸ThunderKittens 2.0 focuses on optimization-through-simplification, reducing unnecessary memory instructions and assembler inefficiencies rather than purely adding new features
- ▸New support for MXFP8/NVFP4 data types and improved scheduling enables state-of-the-art performance matching or exceeding cuBLAS on NVIDIA B200 GPUs
- ▸Technical deep-dive reveals critical GPU optimization insights including memory consistency patterns, tensor core pipelining, and occupancy limitations not well-documented by hardware vendors
Summary
ThunderKittens 2.0, a CUDA-embedded domain-specific language (DSL) for GPU kernel optimization, has been released with significant internal refactoring and new features. Unlike previous releases focused on adding capabilities, version 2.0 emphasizes optimization through subtraction—removing inefficiencies, unnecessary memory instructions, and assembler overhead. The release introduces support for MXFP8/NVFP4 data types, CLC scheduling, tensor memory controllability, and simplified build structures that enable easier kernel adaptation.
The new release achieves notable performance improvements, with state-of-the-art BF16/MXFP8/NVFP4 GEMM kernels that match or exceed cuBLAS performance on NVIDIA B200 GPUs. The team conducted extensive technical analysis identifying subtle inefficiencies in modern NVIDIA GPU optimization, including discoveries around memory consistency patterns, tensor core pipelining behavior, PTX assembler hinting, and occupancy limitations. The refactoring process also incorporated contributions from industry partners who had developed internal forks of ThunderKittens.
ThunderKittens 2.0 includes updates to all existing example kernels using newer APIs and active implementation of additional state-of-the-art kernels such as Flash Attention 4, grouped GEMMs, and GEMV operations. The simplified build structure is designed to facilitate adoption by both human developers and AI agents seeking to customize kernels for their specific use cases.
- Contributions from industry partners and simplified build systems aim to democratize high-performance kernel development for developers and AI agents
Editorial Opinion
ThunderKittens 2.0 represents a mature approach to GPU kernel optimization, prioritizing ruthless efficiency over feature accumulation. The technical discoveries around memory synchronization and tensor core pipelining fill important gaps in GPU optimization documentation, benefiting the broader AI infrastructure community. By incorporating industry contributions and simplifying the development experience, this release could accelerate adoption of custom-optimized kernels across AI companies and research institutions.



