Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels

Key Takeaways

▸Feedforward layers in LLMs exhibit up to 95% sparsity that can be leveraged with minimal performance degradation using L1 regularization
▸TwELL sparse packing format integrates with modern GPU tiled matrix multiplication kernels without additional overhead or pipeline disruption
▸Custom CUDA kernels achieve 20%+ speedups in both inference and training on H100 GPUs while cutting energy and memory usage

Source:

Hacker Newshttps://pub.sakana.ai/sparser-faster-llms/↗

Summary

In collaboration with NVIDIA, Sakana AI introduces new sparse data structures and GPU kernels designed to leverage unstructured sparsity in transformer feedforward layers for more efficient LLM inference and training. The research demonstrates that feedforward layers in modern large language models contain up to 95% sparsity—where most hidden activations approximate zero—and this wasted computation can be eliminated with minimal performance impact.

The team developed TwELL (Tile-wise ELLPACK), a new sparse packing format specifically engineered to integrate seamlessly with NVIDIA's tiled matrix multiplication kernels without disrupting GPU execution pipelines or introducing memory overhead. They complemented this with custom CUDA kernels that fuse multiple matrix multiplications and compress the sparse representation, maximizing throughput for both inference and training workloads.

When applied to billion-parameter LLMs trained with mild L1 regularization, the sparse kernels deliver over 20% speedups on NVIDIA H100 GPUs for both batched inference and training, while simultaneously reducing energy consumption and memory requirements. The research will be presented at ICML 2026.

Results are validated at billion-parameter scale with practical applicability to production LLM deployments

Editorial Opinion

This work addresses a critical pain point in large-scale LLM deployment: the substantial computational waste in feedforward layers. By combining algorithmic insights (high sparsity with L1 regularization) with GPU-efficient kernel design (TwELL format), Sakana AI and NVIDIA have created a practical path to significantly reduce operational costs without sacrificing model quality. The 20% speedup and reduced energy footprint could have meaningful implications for LLM inference at scale.

Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels

Key Takeaways

▸Feedforward layers in LLMs exhibit up to 95% sparsity that can be leveraged with minimal performance degradation using L1 regularization
▸TwELL sparse packing format integrates with modern GPU tiled matrix multiplication kernels without additional overhead or pipeline disruption
▸Custom CUDA kernels achieve 20%+ speedups in both inference and training on H100 GPUs while cutting energy and memory usage

Summary

Results are validated at billion-parameter scale with practical applicability to production LLM deployments

Editorial Opinion

This work addresses a critical pain point in large-scale LLM deployment: the substantial computational waste in feedforward layers. By combining algorithmic insights (high sparsity with L1 regularization) with GPU-efficient kernel design (TwELL format), Sakana AI and NVIDIA have created a practical path to significantly reduce operational costs without sacrificing model quality. The 20% speedup and reduced energy footprint could have meaningful implications for LLM inference at scale.

Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels

Key Takeaways

Summary

Editorial Opinion

More from Sakana AI

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration System as Commercial Product

Sakana AI Introduces Doc-to-LoRA and Text-to-LoRA for Instant LLM Customization

Comments

Suggested

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

Multi-Company Study Reveals Domain-Specific Differences in LLM Self-Confidence Monitoring Across 33 Frontier Models

Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels

Key Takeaways

Summary

Editorial Opinion

More from Sakana AI

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration System as Commercial Product

Sakana AI Introduces Doc-to-LoRA and Text-to-LoRA for Instant LLM Customization

Comments

Suggested

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

Multi-Company Study Reveals Domain-Specific Differences in LLM Self-Confidence Monitoring Across 33 Frontier Models