BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
UPDATEGoogle / Alphabet2026-05-06

Google's Gemma 4 Gets Up to 3x Faster With Multi-Token Prediction

Key Takeaways

  • ▸Multi-Token Prediction uses speculative decoding to predict future tokens with lightweight drafter models, achieving up to 3x faster inference on consumer hardware
  • ▸The technique addresses memory bandwidth bottlenecks in local AI by generating speculative tokens during compute idle time while the main model processes context
  • ▸Testing shows 2.8-3.1x speedups on mobile (Pixel phones) and 2.5x on Apple M4 chips with zero quality degradation
Source:
Hacker Newshttps://arstechnica.com/ai/2026/05/googles-gemma-4-open-ai-models-use-speculative-decoding-to-get-up-to-3x-faster/↗

Summary

Google has released Multi-Token Prediction (MTP) drafters for its open-source Gemma 4 models, using speculative decoding to dramatically accelerate local AI inference. The new experimental feature allows smaller "drafter" models to predict multiple future tokens in parallel while the main model verifies them, effectively producing multiple tokens in the time it previously took to generate just one.

The MTP technology targets a fundamental bottleneck in local AI: memory bandwidth constraints on consumer hardware. Since most personal devices and mobile phones have slower memory than enterprise AI accelerators, processors waste computing cycles waiting to load model parameters. By leveraging idle compute time to speculatively generate tokens with a lightweight drafter (as small as 74 million parameters), Google's approach maintains full quality while dramatically improving speed.

In testing, Google reports speed improvements of 2.8x to 3.1x on Gemma 4's smaller E2B and E4B models running on Pixel phones, and a 2.5x boost for the 31B model on Apple's M4 silicon. The company emphasizes that MTP produces "zero quality degradation" since the primary model verifies all draft tokens before output. Combined with Gemma 4's newly permissive Apache 2.0 license, this update makes powerful open-source AI significantly more practical for edge and local deployment.

  • Gemma 4's Apache 2.0 license and improved performance make local AI more accessible for privacy-conscious users and resource-constrained devices
  • MTP drafters are available now for Gemma 4, with the largest models (26B MoE and 31B Dense) now more feasible for consumer hardware
Large Language Models (LLMs)Generative AIMachine LearningAI Hardware

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
PARTNERSHIP

Samsung Integrates Google AI into Smart Refrigerators for Advanced Food Recognition

2026-05-12
Google / AlphabetGoogle / Alphabet
UPDATE

Google DeepMind Reimagines Mouse Pointer with AI-Powered Gemini Integration

2026-05-12
Google / AlphabetGoogle / Alphabet
INDUSTRY REPORT

Five Architects of the AI Economy Explain Where the Wheels Are Coming Off

2026-05-12

Comments

Suggested

vlm-runvlm-run
OPEN SOURCE

mm-ctx: Open-Source Multimodal CLI Toolkit Brings Vision Capabilities to AI Agents

2026-05-12
AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Unleashes Computer Use: Claude 3.5 Sonnet Now Controls Your Desktop

2026-05-12
AnthropicAnthropic
PARTNERSHIP

SpaceX Backs Anthropic with Massive Data Centre Deal Amidst Musk's OpenAI Legal Battle

2026-05-12
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us