Google Retrofits Multi-Token Prediction Into Frozen Gemini Nano Models for Faster Mobile AI
Key Takeaways
- ▸Multi-Token Prediction retrofitted onto frozen Gemini Nano v3 models without requiring model retraining
- ▸Eliminates need for separate drafting models, reducing memory overhead and enabling faster deployment across tasks
- ▸Already rolled out to Pixel 9 and 10 devices, powering faster AI features like notification summaries and text proofreading
Summary
Google announced a new method to retrofit Multi-Token Prediction (MTP) onto existing, frozen Gemini Nano v3 models, significantly accelerating on-device inference on Pixel 9 and 10 phones without requiring separate drafting models. The technique addresses a critical constraint in mobile AI: autoregressive text generation underutilizes processing power while draining memory bandwidth and battery life. Rather than training standalone drafter models, Google appends a lightweight Transformer head to the main model's final layers, leveraging the model's existing computations to predict multiple future tokens in parallel.
The innovation builds on speculative decoding techniques like EAGLE and CALM, but optimizes for the extreme resource constraints of mobile devices. By eliminating the memory overhead and semantic blindness of standalone drafters, the frozen-model approach enables developers to deploy high-speed on-device AI without the friction of fine-tuning separate models for each task. Google reports that features like AI Notification Summaries and Proofread now generate text significantly faster while consuming less energy.
- Addresses core mobile constraint: leverages existing model computations to parallelize token generation while respecting strict energy and RAM budgets
- Practical solution for on-device LLM inference that reduces friction for developers adopting edge AI
Editorial Opinion
This work represents a pragmatic engineering solution to a real problem: bringing LLM capabilities to billions of phones while respecting hardware constraints. By retrofitting MTP onto frozen models rather than requiring retraining, Google has unlocked a path for rapid deployment of inference speedups across its installed base—a significant advantage for both the company and developers. The architectural insight that a lightweight head can leverage the backbone's latent representations is elegant and generalizable, likely to influence how other teams optimize on-device inference at scale.



