BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-07-02

Google Retrofits Multi-Token Prediction Into Frozen Gemini Nano Models for Faster Mobile AI

Key Takeaways

  • ▸Multi-Token Prediction retrofitted onto frozen Gemini Nano v3 models without requiring model retraining
  • ▸Eliminates need for separate drafting models, reducing memory overhead and enabling faster deployment across tasks
  • ▸Already rolled out to Pixel 9 and 10 devices, powering faster AI features like notification summaries and text proofreading
Source:
Hacker Newshttps://research.google/blog/accelerating-gemini-nano-models-on-pixel-with-frozen-multi-token-prediction/↗

Summary

Google announced a new method to retrofit Multi-Token Prediction (MTP) onto existing, frozen Gemini Nano v3 models, significantly accelerating on-device inference on Pixel 9 and 10 phones without requiring separate drafting models. The technique addresses a critical constraint in mobile AI: autoregressive text generation underutilizes processing power while draining memory bandwidth and battery life. Rather than training standalone drafter models, Google appends a lightweight Transformer head to the main model's final layers, leveraging the model's existing computations to predict multiple future tokens in parallel.

The innovation builds on speculative decoding techniques like EAGLE and CALM, but optimizes for the extreme resource constraints of mobile devices. By eliminating the memory overhead and semantic blindness of standalone drafters, the frozen-model approach enables developers to deploy high-speed on-device AI without the friction of fine-tuning separate models for each task. Google reports that features like AI Notification Summaries and Proofread now generate text significantly faster while consuming less energy.

  • Addresses core mobile constraint: leverages existing model computations to parallelize token generation while respecting strict energy and RAM budgets
  • Practical solution for on-device LLM inference that reduces friction for developers adopting edge AI

Editorial Opinion

This work represents a pragmatic engineering solution to a real problem: bringing LLM capabilities to billions of phones while respecting hardware constraints. By retrofitting MTP onto frozen models rather than requiring retraining, Google has unlocked a path for rapid deployment of inference speedups across its installed base—a significant advantage for both the company and developers. The architectural insight that a lightweight head can leverage the backbone's latent representations is elegant and generalizable, likely to influence how other teams optimize on-device inference at scale.

Large Language Models (LLMs)Generative AIDeep LearningMLOps & Infrastructure

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
RESEARCH

Google Automates Model Design for Edge AI, Achieving 45× Speed Improvements on Microcontrollers

2026-06-19
Google / AlphabetGoogle / Alphabet
RESEARCH

Google Denies Bounty for Critical Kubernetes Vulnerability After Initial 'Nice Catch' Response

2026-06-19
Google / AlphabetGoogle / Alphabet
INDUSTRY REPORT

The Limits of AI in Understanding the Human Genome

2026-06-19

Comments

Suggested

AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Launches Claude Science: AI Research Workbench for Life Scientists

2026-07-02
UC BerkeleyUC Berkeley
RESEARCH

UC Berkeley's DocETL Brings Declarative LLM-Powered Data Processing to VLDB 2025

2026-07-02
PalantirPalantir
INDUSTRY REPORT

Palantir CEO Alex Karp Warns Industry Against Problematic AI Sales Practices

2026-07-02
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us