BotBeat
...
← Back

> ▌

ThunderKittens (Open Source Project)ThunderKittens (Open Source Project)
PRODUCT LAUNCHThunderKittens (Open Source Project)2026-03-17

ThunderKittens 2.0 Released: Major Refactor Delivers Faster GPU Kernels and State-of-the-Art Performance

Key Takeaways

  • ▸ThunderKittens 2.0 focuses on optimization-through-simplification, reducing unnecessary memory instructions and assembler inefficiencies rather than purely adding new features
  • ▸New support for MXFP8/NVFP4 data types and improved scheduling enables state-of-the-art performance matching or exceeding cuBLAS on NVIDIA B200 GPUs
  • ▸Technical deep-dive reveals critical GPU optimization insights including memory consistency patterns, tensor core pipelining, and occupancy limitations not well-documented by hardware vendors
Source:
Hacker Newshttps://hazyresearch.stanford.edu/blog/2026-02-19-tk-2↗

Summary

ThunderKittens 2.0, a CUDA-embedded domain-specific language (DSL) for GPU kernel optimization, has been released with significant internal refactoring and new features. Unlike previous releases focused on adding capabilities, version 2.0 emphasizes optimization through subtraction—removing inefficiencies, unnecessary memory instructions, and assembler overhead. The release introduces support for MXFP8/NVFP4 data types, CLC scheduling, tensor memory controllability, and simplified build structures that enable easier kernel adaptation.

The new release achieves notable performance improvements, with state-of-the-art BF16/MXFP8/NVFP4 GEMM kernels that match or exceed cuBLAS performance on NVIDIA B200 GPUs. The team conducted extensive technical analysis identifying subtle inefficiencies in modern NVIDIA GPU optimization, including discoveries around memory consistency patterns, tensor core pipelining behavior, PTX assembler hinting, and occupancy limitations. The refactoring process also incorporated contributions from industry partners who had developed internal forks of ThunderKittens.

ThunderKittens 2.0 includes updates to all existing example kernels using newer APIs and active implementation of additional state-of-the-art kernels such as Flash Attention 4, grouped GEMMs, and GEMV operations. The simplified build structure is designed to facilitate adoption by both human developers and AI agents seeking to customize kernels for their specific use cases.

  • Contributions from industry partners and simplified build systems aim to democratize high-performance kernel development for developers and AI agents

Editorial Opinion

ThunderKittens 2.0 represents a mature approach to GPU kernel optimization, prioritizing ruthless efficiency over feature accumulation. The technical discoveries around memory synchronization and tensor core pipelining fill important gaps in GPU optimization documentation, benefiting the broader AI infrastructure community. By incorporating industry contributions and simplifying the development experience, this release could accelerate adoption of custom-optimized kernels across AI companies and research institutions.

MLOps & InfrastructureAI HardwareOpen Source

Comments

Suggested

NVIDIANVIDIA
POLICY & REGULATION

China Bans Nvidia RTX 5090D V2 During CEO Huang's Visit, Escalating AI Hardware Trade War

2026-05-20
AnthropicAnthropic
RESEARCH

Anthropic Claude Code Sandbox Bypass: Second Vulnerability Exposes Critical Data Exfiltration Risk

2026-05-20
OpenAIOpenAI
POLICY & REGULATION

OpenAI Data Center Opposition Escalates: Michigan Township Treasurer Resigns After Threats

2026-05-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us