Google's Gemma 4 Gets Up to 3x Faster With Multi-Token Prediction

Key Takeaways

▸Multi-Token Prediction uses speculative decoding to predict future tokens with lightweight drafter models, achieving up to 3x faster inference on consumer hardware
▸The technique addresses memory bandwidth bottlenecks in local AI by generating speculative tokens during compute idle time while the main model processes context
▸Testing shows 2.8-3.1x speedups on mobile (Pixel phones) and 2.5x on Apple M4 chips with zero quality degradation

Source:

Hacker Newshttps://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/↗

Summary

Google has released Multi-Token Prediction (MTP) drafters for its open-source Gemma 4 models, using speculative decoding to dramatically accelerate local AI inference. The new experimental feature allows smaller "drafter" models to predict multiple future tokens in parallel while the main model verifies them, effectively producing multiple tokens in the time it previously took to generate just one.

The MTP technology targets a fundamental bottleneck in local AI: memory bandwidth constraints on consumer hardware. Since most personal devices and mobile phones have slower memory than enterprise AI accelerators, processors waste computing cycles waiting to load model parameters. By leveraging idle compute time to speculatively generate tokens with a lightweight drafter (as small as 74 million parameters), Google's approach maintains full quality while dramatically improving speed.

In testing, Google reports speed improvements of 2.8x to 3.1x on Gemma 4's smaller E2B and E4B models running on Pixel phones, and a 2.5x boost for the 31B model on Apple's M4 silicon. The company emphasizes that MTP produces "zero quality degradation" since the primary model verifies all draft tokens before output. Combined with Gemma 4's newly permissive Apache 2.0 license, this update makes powerful open-source AI significantly more practical for edge and local deployment.

Gemma 4's Apache 2.0 license and improved performance make local AI more accessible for privacy-conscious users and resource-constrained devices
MTP drafters are available now for Gemma 4, with the largest models (26B MoE and 31B Dense) now more feasible for consumer hardware

Google / Alphabet

UPDATE Google / Alphabet2026-05-06

Google's Gemma 4 Gets Up to 3x Faster With Multi-Token Prediction

Key Takeaways

▸Multi-Token Prediction uses speculative decoding to predict future tokens with lightweight drafter models, achieving up to 3x faster inference on consumer hardware
▸The technique addresses memory bandwidth bottlenecks in local AI by generating speculative tokens during compute idle time while the main model processes context
▸Testing shows 2.8-3.1x speedups on mobile (Pixel phones) and 2.5x on Apple M4 chips with zero quality degradation

Source:

Hacker Newshttps://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/↗

Summary

Gemma 4's Apache 2.0 license and improved performance make local AI more accessible for privacy-conscious users and resource-constrained devices
MTP drafters are available now for Gemma 4, with the largest models (26B MoE and 31B Dense) now more feasible for consumer hardware

Google's Gemma 4 Gets Up to 3x Faster With Multi-Token Prediction

Key Takeaways

Summary

More from Google / Alphabet

Samsung Integrates Google AI into Smart Refrigerators for Advanced Food Recognition

Google DeepMind Reimagines Mouse Pointer with AI-Powered Gemini Integration

Five Architects of the AI Economy Explain Where the Wheels Are Coming Off

Comments

Suggested

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

Google's Gemma 4 Gets Up to 3x Faster With Multi-Token Prediction

Key Takeaways

Summary

More from Google / Alphabet

Samsung Integrates Google AI into Smart Refrigerators for Advanced Food Recognition

Google DeepMind Reimagines Mouse Pointer with AI-Powered Gemini Integration

Five Architects of the AI Economy Explain Where the Wheels Are Coming Off

Comments

Suggested

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle