vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models
Key Takeaways
- ▸vLLM now supports disaggregated prefill/decode serving for hybrid SSM-FA models (available in v0.20.0+)
- ▸Hybrid models combine SSM layers (linear-time, efficient) with full-attention layers (expressive, accurate) to balance performance and model quality
- ▸The solution uses dual descriptor views and block bridging to handle two different KV cache formats without modifying existing transformer support
Summary
UC Berkeley's vLLM team has extended disaggregated prefill/decode (P/D) serving support to hybrid state-space-model (SSM) architectures that interleave SSM layers with full-attention layers, such as NVIDIA's Nemotron-H. Previously, vLLM's NIXL-based disaggregated serving—where a prefill worker computes KV cache blocks and a decode worker pulls them via RDMA to eliminate redundant computation—only worked with standard transformer models. The challenge was that SSM layers store fundamentally different state (collapsed conv state and temporal SSM state) compared to attention layers, requiring different block sizes and layouts.
The solution introduces three key techniques: dual descriptor views that maintain separate block descriptors for FA and SSM blocks indexing the same physical memory with different offsets and sizes; physical/logical block bridging to handle mismatches between the block abstraction and actual attention kernel requirements; and 3-descriptor conv state transfer for heterogeneous tensor-parallel transfers without data reshuffling. These changes are purely additive extensions that activate only when models contain SSM layers, leaving the standard transformer workflow entirely unchanged. The feature is available in vLLM v0.20.0 and later.
This advancement addresses growing adoption of hybrid architectures that combine the linear-time efficiency of SSMs with the expressiveness of full attention. Disaggregated serving infrastructure enables significantly more efficient deployment of these emerging model architectures by reducing computational overhead during inference.
- Disaggregated serving enables compute-efficient inference by pulling pre-computed KV cache blocks via RDMA instead of recomputing them, critical as hybrid architectures gain adoption


