Disaggregated Serving for Hybrid SSM Models in vLLM

vLLM Team· Apr 28, 2026 · 8:33 PM UTC ·13 min read · 0 reactions · 0 comments · 1 view

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way

Original article

Vercel · vLLM Team

Read full at Vercel →

Opening excerpt (first ~120 words) tap to expand

Disaggregated Serving for Hybrid SSM Models in vLLMApril 21, 202615 min readNicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team#disaggregation#mambaIntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentThe HMA Shared-Tensor LayoutDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferThe DS Layout SolutionZero-Overhead: No Extra Buffers, No PermutationPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsTable of ContentsIntroduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time efficiency of state-space…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Vercel.

Anonymous · no account needed

Discussion

0 comments

Disaggregated Serving for Hybrid SSM Models in vLLM

Discussion

More from Vercel