BotBeat
...
← Back

> ▌

ThunderKittens (Open Source Project)ThunderKittens (Open Source Project)
PRODUCT LAUNCHThunderKittens (Open Source Project)2026-03-17

ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

  • ▸ThunderKittens 2.0 focuses on optimization-through-simplification, reducing unnecessary memory instructions and assembler inefficiencies rather than purely adding new features
  • ▸New support for MXFP8/NVFP4 data types and improved scheduling enables state-of-the-art performance matching or exceeding cuBLAS on NVIDIA B200 GPUs
  • ▸Technical deep-dive reveals critical GPU optimization insights including memory consistency patterns, tensor core pipelining, and occupancy limitations not well-documented by hardware vendors
Source:
Hacker Newshttps://hazyresearch.stanford.edu/blog/2026-02-19-tk-2↗

Summary

ThunderKittens 2.0, a CUDA-embedded domain-specific language (DSL) for GPU kernel optimization, has been released with significant internal refactoring and new features. Unlike previous releases focused on adding capabilities, version 2.0 emphasizes optimization through subtraction—removing inefficiencies, unnecessary memory instructions, and assembler overhead. The release introduces support for MXFP8/NVFP4 data types, CLC scheduling, tensor memory controllability, and simplified build structures that enable easier kernel adaptation.

The new release achieves notable performance improvements, with state-of-the-art BF16/MXFP8/NVFP4 GEMM kernels that match or exceed cuBLAS performance on NVIDIA B200 GPUs. The team conducted extensive technical analysis identifying subtle inefficiencies in modern NVIDIA GPU optimization, including discoveries around memory consistency patterns, tensor core pipelining behavior, PTX assembler hinting, and occupancy limitations. The refactoring process also incorporated contributions from industry partners who had developed internal forks of ThunderKittens.

ThunderKittens 2.0 includes updates to all existing example kernels using newer APIs and active implementation of additional state-of-the-art kernels such as Flash Attention 4, grouped GEMMs, and GEMV operations. The simplified build structure is designed to facilitate adoption by both human developers and AI agents seeking to customize kernels for their specific use cases.

  • Contributions from industry partners and simplified build systems aim to democratize high-performance kernel development for developers and AI agents

Editorial Opinion

ThunderKittens 2.0 represents a mature approach to GPU kernel optimization, prioritizing ruthless efficiency over feature accumulation. The technical discoveries around memory synchronization and tensor core pipelining fill important gaps in GPU optimization documentation, benefiting the broader AI infrastructure community. By incorporating industry contributions and simplifying the development experience, this release could accelerate adoption of custom-optimized kernels across AI companies and research institutions.

MLOps & InfrastructureAI HardwareOpen Source

Comments

Suggested

Google / AlphabetGoogle / Alphabet
RESEARCH

Deep Dive: Optimizing Sharded Matrix Multiplication on TPU with Pallas

2026-04-05
GitHubGitHub
PRODUCT LAUNCH

GitHub Launches Squad: Open Source Multi-Agent AI Framework to Simplify Complex Workflows

2026-04-05
NVIDIANVIDIA
RESEARCH

Nvidia Pivots to Optical Interconnects as Copper Hits Physical Limits, Plans 1,000+ GPU Systems by 2028

2026-04-05
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us