ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

▸ThunderKittens 2.0 focuses on optimization-through-simplification, reducing unnecessary memory instructions and assembler inefficiencies rather than purely adding new features
▸New support for MXFP8/NVFP4 data types and improved scheduling enables state-of-the-art performance matching or exceeding cuBLAS on NVIDIA B200 GPUs
▸Technical deep-dive reveals critical GPU optimization insights including memory consistency patterns, tensor core pipelining, and occupancy limitations not well-documented by hardware vendors

Source:

Hacker Newshttps://hazyresearch.stanford.edu/blog/2026-02-19-tk-2↗

Summary

ThunderKittens 2.0, a CUDA-embedded domain-specific language (DSL) for GPU kernel optimization, has been released with significant internal refactoring and new features. Unlike previous releases focused on adding capabilities, version 2.0 emphasizes optimization through subtraction—removing inefficiencies, unnecessary memory instructions, and assembler overhead. The release introduces support for MXFP8/NVFP4 data types, CLC scheduling, tensor memory controllability, and simplified build structures that enable easier kernel adaptation.

The new release achieves notable performance improvements, with state-of-the-art BF16/MXFP8/NVFP4 GEMM kernels that match or exceed cuBLAS performance on NVIDIA B200 GPUs. The team conducted extensive technical analysis identifying subtle inefficiencies in modern NVIDIA GPU optimization, including discoveries around memory consistency patterns, tensor core pipelining behavior, PTX assembler hinting, and occupancy limitations. The refactoring process also incorporated contributions from industry partners who had developed internal forks of ThunderKittens.

ThunderKittens 2.0 includes updates to all existing example kernels using newer APIs and active implementation of additional state-of-the-art kernels such as Flash Attention 4, grouped GEMMs, and GEMV operations. The simplified build structure is designed to facilitate adoption by both human developers and AI agents seeking to customize kernels for their specific use cases.

Contributions from industry partners and simplified build systems aim to democratize high-performance kernel development for developers and AI agents

Editorial Opinion

ThunderKittens 2.0 represents a mature approach to GPU kernel optimization, prioritizing ruthless efficiency over feature accumulation. The technical discoveries around memory synchronization and tensor core pipelining fill important gaps in GPU optimization documentation, benefiting the broader AI infrastructure community. By incorporating industry contributions and simplifying the development experience, this release could accelerate adoption of custom-optimized kernels across AI companies and research institutions.

ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

▸ThunderKittens 2.0 focuses on optimization-through-simplification, reducing unnecessary memory instructions and assembler inefficiencies rather than purely adding new features
▸New support for MXFP8/NVFP4 data types and improved scheduling enables state-of-the-art performance matching or exceeding cuBLAS on NVIDIA B200 GPUs
▸Technical deep-dive reveals critical GPU optimization insights including memory consistency patterns, tensor core pipelining, and occupancy limitations not well-documented by hardware vendors

Summary

Contributions from industry partners and simplified build systems aim to democratize high-performance kernel development for developers and AI agents

Editorial Opinion

ThunderKittens 2.0 represents a mature approach to GPU kernel optimization, prioritizing ruthless efficiency over feature accumulation. The technical discoveries around memory synchronization and tensor core pipelining fill important gaps in GPU optimization documentation, benefiting the broader AI infrastructure community. By incorporating industry contributions and simplifying the development experience, this release could accelerate adoption of custom-optimized kernels across AI companies and research institutions.

ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Researchers Discover Six Vulnerabilities in Apple AirDrop and Google/Samsung Quick Share Protocols

ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

Summary

Editorial Opinion

Comments

Suggested

Stanford Researchers Use Multi-Agent AI and Reinforcement Learning to Improve HIP Kernel Generation for AMD GPUs

Researchers Expose Critical Payload-Less Attack on LLM Agent Supply Chains

Researchers Discover Six Vulnerabilities in Apple AirDrop and Google/Samsung Quick Share Protocols