BotBeat
...
← Back

> ▌

Google / AlphabetGoogle / Alphabet
RESEARCHGoogle / Alphabet2026-05-05

UCSD Researchers Achieve 3X Speedup on Google TPUs with Diffusion-Style Speculative Decoding

Key Takeaways

  • ▸DFlash achieves 3.13x average speedup in tokens per second on TPU v5p, with peaks near 6x on complex math tasks
  • ▸Paradigm shift from O(K) sequential autoregressive drafting to O(1) parallel block diffusion generation
  • ▸Outperforms EAGLE-3 by 1.76x (2.29x vs 1.30x end-to-end serving speedup on TPU v5p)
Source:
Hacker Newshttps://developers.googleblog.com/supercharging-llm-inference-on-google-tpus-achieving-3x-speedups-with-diffusion-style-speculative-decoding/↗

Summary

Researchers at UCSD, led by Hao Zhang (co-inventor of paged attention), have successfully implemented DFlash—a novel diffusion-style speculative decoding technique—on Google TPUs, achieving a 3.13x average increase in tokens per second on TPU v5p, with peak speedups reaching nearly 6x for complex math tasks. The breakthrough addresses a fundamental bottleneck in current LLM acceleration: traditional autoregressive speculative decoding requires K sequential forward passes to generate K candidate tokens, constraining practical speedup potential. DFlash replaces this sequential token-by-token drafting with block diffusion, enabling an entire block of candidate tokens to be generated in a single O(1) forward pass, dramatically reducing drafting latency and better utilizing TPU's massive parallel compute capabilities.

The UCSD team integrated DFlash directly into the open-source vLLM TPU inference ecosystem, demonstrating significant performance gains over existing methods. In a head-to-head comparison with EAGLE-3 on TPU v5p, DFlash achieved a 2.29x end-to-end serving speedup compared to EAGLE-3's 1.30x improvement, showcasing the superior efficiency of the diffusion-based approach. By leveraging hidden features extracted from the target model and working closely with Google Cloud engineers, the researchers optimized the implementation specifically for TPU's Matrix Multiplication Units (MXUs), highlighting the importance of hardware-aware algorithm design.

  • Open-source integration with vLLM enables widespread adoption and optimization across TPU ecosystem
  • Demonstrates the critical importance of rethinking fundamental inference patterns to overcome autoregressive bottlenecks

Editorial Opinion

This work represents a paradigm-shifting approach to LLM inference optimization. By moving beyond the constraints of autoregressive drafting toward block diffusion, the UCSD team has fundamentally changed the architecture of speculative decoding—with concrete results that nearly double the performance gains of leading competing methods. This breakthrough suggests that optimizing for hardware parallelism through algorithmic innovation, rather than incremental improvements to existing patterns, may unlock substantial gains across the entire AI inference industry.

Large Language Models (LLMs)Machine LearningMLOps & InfrastructureAI HardwareOpen Source

More from Google / Alphabet

Google / AlphabetGoogle / Alphabet
RESEARCH

Google Automates Model Design for Edge AI, Achieving 45× Speed Improvements on Microcontrollers

2026-06-19
Google / AlphabetGoogle / Alphabet
RESEARCH

Google Denies Bounty for Critical Kubernetes Vulnerability After Initial 'Nice Catch' Response

2026-06-19
Google / AlphabetGoogle / Alphabet
INDUSTRY REPORT

The Limits of AI in Understanding the Human Genome

2026-06-19

Comments

Suggested

Z.aiZ.ai
PRODUCT LAUNCH

Z.ai Launches GLM-5.2, Claims Fable 5-Class Model Coming Within Months

2026-06-20
Moebius Research ProjectMoebius Research Project
RESEARCH

Moebius: Lightweight Image Inpainting Framework Achieves 10B-Level Quality with Just 0.2B Parameters

2026-06-20
InceptionInception
PRODUCT LAUNCH

Inception Unveils Mercury 2: Parallel-Token Diffusion Models Reshape LLM Performance Economics

2026-06-20
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us