BotBeat
...
← Back

> ▌

Sakana AISakana AI
RESEARCHSakana AI2026-05-08

Sakana AI and NVIDIA Achieve 20% Speedup in LLM Inference with Sparse Transformer Kernels

Key Takeaways

  • ▸Feedforward layers in LLMs exhibit up to 95% sparsity that can be leveraged with minimal performance degradation using L1 regularization
  • ▸TwELL sparse packing format integrates with modern GPU tiled matrix multiplication kernels without additional overhead or pipeline disruption
  • ▸Custom CUDA kernels achieve 20%+ speedups in both inference and training on H100 GPUs while cutting energy and memory usage
Source:
Hacker Newshttps://pub.sakana.ai/sparser-faster-llms/↗

Summary

In collaboration with NVIDIA, Sakana AI introduces new sparse data structures and GPU kernels designed to leverage unstructured sparsity in transformer feedforward layers for more efficient LLM inference and training. The research demonstrates that feedforward layers in modern large language models contain up to 95% sparsity—where most hidden activations approximate zero—and this wasted computation can be eliminated with minimal performance impact.

The team developed TwELL (Tile-wise ELLPACK), a new sparse packing format specifically engineered to integrate seamlessly with NVIDIA's tiled matrix multiplication kernels without disrupting GPU execution pipelines or introducing memory overhead. They complemented this with custom CUDA kernels that fuse multiple matrix multiplications and compress the sparse representation, maximizing throughput for both inference and training workloads.

When applied to billion-parameter LLMs trained with mild L1 regularization, the sparse kernels deliver over 20% speedups on NVIDIA H100 GPUs for both batched inference and training, while simultaneously reducing energy consumption and memory requirements. The research will be presented at ICML 2026.

  • Results are validated at billion-parameter scale with practical applicability to production LLM deployments

Editorial Opinion

This work addresses a critical pain point in large-scale LLM deployment: the substantial computational waste in feedforward layers. By combining algorithmic insights (high sparsity with L1 regularization) with GPU-efficient kernel design (TwELL format), Sakana AI and NVIDIA have created a practical path to significantly reduce operational costs without sacrificing model quality. The 20% speedup and reduced energy footprint could have meaningful implications for LLM inference at scale.

Large Language Models (LLMs)Deep LearningMLOps & InfrastructurePartnerships

More from Sakana AI

Sakana AISakana AI
PRODUCT LAUNCH

Sakana AI Launches Sakana Fugu: Multi-Agent Orchestration System as Commercial Product

2026-04-24
Sakana AISakana AI
RESEARCH

Sakana AI Introduces Doc-to-LoRA and Text-to-LoRA for Instant LLM Customization

2026-02-27

Comments

Suggested

AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

2026-05-12
AnthropicAnthropic
PARTNERSHIP

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

2026-05-12
Multiple AI CompaniesMultiple AI Companies
RESEARCH

Multi-Company Study Reveals Domain-Specific Differences in LLM Self-Confidence Monitoring Across 33 Frontier Models

2026-05-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us