vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models

Key Takeaways

▸vLLM now supports disaggregated prefill/decode serving for hybrid SSM-FA models (available in v0.20.0+)
▸Hybrid models combine SSM layers (linear-time, efficient) with full-attention layers (expressive, accurate) to balance performance and model quality
▸The solution uses dual descriptor views and block bridging to handle two different KV cache formats without modifying existing transformer support

Source:

Hacker Newshttps://vllm-website-lx4pji0mz-inferact-inc.vercel.app/blog/hybrid-ssm-disagg↗

Summary

UC Berkeley's vLLM team has extended disaggregated prefill/decode (P/D) serving support to hybrid state-space-model (SSM) architectures that interleave SSM layers with full-attention layers, such as NVIDIA's Nemotron-H. Previously, vLLM's NIXL-based disaggregated serving—where a prefill worker computes KV cache blocks and a decode worker pulls them via RDMA to eliminate redundant computation—only worked with standard transformer models. The challenge was that SSM layers store fundamentally different state (collapsed conv state and temporal SSM state) compared to attention layers, requiring different block sizes and layouts.

The solution introduces three key techniques: dual descriptor views that maintain separate block descriptors for FA and SSM blocks indexing the same physical memory with different offsets and sizes; physical/logical block bridging to handle mismatches between the block abstraction and actual attention kernel requirements; and 3-descriptor conv state transfer for heterogeneous tensor-parallel transfers without data reshuffling. These changes are purely additive extensions that activate only when models contain SSM layers, leaving the standard transformer workflow entirely unchanged. The feature is available in vLLM v0.20.0 and later.

This advancement addresses growing adoption of hybrid architectures that combine the linear-time efficiency of SSMs with the expressiveness of full attention. Disaggregated serving infrastructure enables significantly more efficient deployment of these emerging model architectures by reducing computational overhead during inference.

Disaggregated serving enables compute-efficient inference by pulling pre-computed KV cache blocks via RDMA instead of recomputing them, critical as hybrid architectures gain adoption

vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models

Key Takeaways

▸vLLM now supports disaggregated prefill/decode serving for hybrid SSM-FA models (available in v0.20.0+)
▸Hybrid models combine SSM layers (linear-time, efficient) with full-attention layers (expressive, accurate) to balance performance and model quality
▸The solution uses dual descriptor views and block bridging to handle two different KV cache formats without modifying existing transformer support

Summary

Disaggregated serving enables compute-efficient inference by pulling pre-computed KV cache blocks via RDMA instead of recomputing them, critical as hybrid architectures gain adoption

vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models

Key Takeaways

Summary

More from UC Berkeley

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Comments

Suggested

Alibaba Qwen3-Coder Achieves 89% Solve Rate with Debugger Integration, 59% Fewer Turns Required

Anthropic Announces Claude Mythos Preview: AI Model That Autonomously Finds Software Vulnerabilities

Claude AI Agent Deletes Car Rental Company's Production Database in 9 Seconds

vLLM Extends Disaggregated Serving to Hybrid SSM-FA Models

Key Takeaways

Summary

More from UC Berkeley

K-Search: New AI Framework Achieves 14x Speedup in GPU Kernel Optimization

Comments

Suggested

Alibaba Qwen3-Coder Achieves 89% Solve Rate with Debugger Integration, 59% Fewer Turns Required

Anthropic Announces Claude Mythos Preview: AI Model That Autonomously Finds Software Vulnerabilities

Claude AI Agent Deletes Car Rental Company's Production Database in 9 Seconds