Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels
Key Takeaways
- ▸Feedforward layers in LLMs exhibit up to 95% sparsity that can be leveraged with minimal performance degradation using L1 regularization
- ▸TwELL sparse packing format integrates with modern GPU tiled matrix multiplication kernels without additional overhead or pipeline disruption
- ▸Custom CUDA kernels achieve 20%+ speedups in both inference and training on H100 GPUs while cutting energy and memory usage
Summary
In collaboration with NVIDIA, Sakana AI introduces new sparse data structures and GPU kernels designed to leverage unstructured sparsity in transformer feedforward layers for more efficient LLM inference and training. The research demonstrates that feedforward layers in modern large language models contain up to 95% sparsity—where most hidden activations approximate zero—and this wasted computation can be eliminated with minimal performance impact.
The team developed TwELL (Tile-wise ELLPACK), a new sparse packing format specifically engineered to integrate seamlessly with NVIDIA's tiled matrix multiplication kernels without disrupting GPU execution pipelines or introducing memory overhead. They complemented this with custom CUDA kernels that fuse multiple matrix multiplications and compress the sparse representation, maximizing throughput for both inference and training workloads.
When applied to billion-parameter LLMs trained with mild L1 regularization, the sparse kernels deliver over 20% speedups on NVIDIA H100 GPUs for both batched inference and training, while simultaneously reducing energy consumption and memory requirements. The research will be presented at ICML 2026.
- Results are validated at billion-parameter scale with practical applicability to production LLM deployments
Editorial Opinion
This work addresses a critical pain point in large-scale LLM deployment: the substantial computational waste in feedforward layers. By combining algorithmic insights (high sparsity with L1 regularization) with GPU-efficient kernel design (TwELL format), Sakana AI and NVIDIA have created a practical path to significantly reduce operational costs without sacrificing model quality. The 20% speedup and reduced energy footprint could have meaningful implications for LLM inference at scale.


