BotBeat
...
← Back

> ▌

UC BerkeleyUC Berkeley
UPDATEUC Berkeley2026-04-28

vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models

Key Takeaways

  • ▸vLLM now supports disaggregated prefill/decode serving for hybrid SSM-FA models (available in v0.20.0+)
  • ▸Hybrid models combine SSM layers (linear-time, efficient) with full-attention layers (expressive, accurate) to balance performance and model quality
  • ▸The solution uses dual descriptor views and block bridging to handle two different KV cache formats without modifying existing transformer support
Source:
Hacker Newshttps://vllm-website-lx4pji0mz-inferact-inc.vercel.app/blog/hybrid-ssm-disagg↗

Summary

UC Berkeley's vLLM team has extended disaggregated prefill/decode (P/D) serving support to hybrid state-space-model (SSM) architectures that interleave SSM layers with full-attention layers, such as NVIDIA's Nemotron-H. Previously, vLLM's NIXL-based disaggregated serving—where a prefill worker computes KV cache blocks and a decode worker pulls them via RDMA to eliminate redundant computation—only worked with standard transformer models. The challenge was that SSM layers store fundamentally different state (collapsed conv state and temporal SSM state) compared to attention layers, requiring different block sizes and layouts.

The solution introduces three key techniques: dual descriptor views that maintain separate block descriptors for FA and SSM blocks indexing the same physical memory with different offsets and sizes; physical/logical block bridging to handle mismatches between the block abstraction and actual attention kernel requirements; and 3-descriptor conv state transfer for heterogeneous tensor-parallel transfers without data reshuffling. These changes are purely additive extensions that activate only when models contain SSM layers, leaving the standard transformer workflow entirely unchanged. The feature is available in vLLM v0.20.0 and later.

This advancement addresses growing adoption of hybrid architectures that combine the linear-time efficiency of SSMs with the expressiveness of full attention. Disaggregated serving infrastructure enables significantly more efficient deployment of these emerging model architectures by reducing computational overhead during inference.

  • Disaggregated serving enables compute-efficient inference by pulling pre-computed KV cache blocks via RDMA instead of recomputing them, critical as hybrid architectures gain adoption
Large Language Models (LLMs)Machine LearningMLOps & InfrastructureOpen Source

More from UC Berkeley

UC BerkeleyUC Berkeley
RESEARCH

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

2026-02-26

Comments

Suggested

Alibaba (Cloud)Alibaba (Cloud)
RESEARCH

Alibaba Qwen3-Coder Achieves 89% Solve Rate with Debugger Integration, 59% Fewer Turns Required

2026-04-28
AnthropicAnthropic
PRODUCT LAUNCH

Anthropic Announces Claude Mythos Preview: AI Model That Autonomously Finds Software Vulnerabilities

2026-04-28
AnthropicAnthropic
INDUSTRY REPORT

Claude AI Agent Deletes Car Rental Company's Production Database in 9 Seconds

2026-04-28
← Back to news
© 2026 BotBeat
AboutPrivacy PolicyTerms of ServiceContact Us