WeSearch

Disaggregated Serving for Hybrid SSM Models in vLLM

vLLM Team· ·13 min read · 0 reactions · 0 comments · 9 views
#machine learning#model serving#state-space models#transformer models#distributed systems
Disaggregated Serving for Hybrid SSM Models in vLLM
⚡ TL;DR · AI summary

Hybrid models combining Mamba-style SSM layers with full-attention (FA) layers, such as NVIDIA Nemotron-H, are increasingly used for their efficiency and expressiveness. vLLM now supports disaggregated prefill/decode serving for these hybrid models by extending its NIXL-based KV connector to handle fundamentally different state formats. The solution introduces dual descriptor views, physical/logical block bridging, and a 3-descriptor conv transfer without modifying existing workflows for standard transformers.

Key facts
Original article
Vercel · vLLM Team
Read full at Vercel →
Opening excerpt (first ~120 words) tap to expand

Disaggregated Serving for Hybrid SSM Models in vLLMApril 21, 202615 min readNicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team#disaggregation#mambaIntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentThe HMA Shared-Tensor LayoutDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferThe DS Layout SolutionZero-Overhead: No Extra Buffers, No PermutationPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsTable of ContentsIntroduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time efficiency of state-space…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Vercel.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Vercel