WeSearch

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

·3 min read · 0 reactions · 0 comments · 2 views
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.

Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2604.22782 (cs) [Submitted on 3 Apr 2026] Title:Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing Authors:Anastasiia Filippova, David Grangier, Marco Cuturi, João Monteiro View a PDF of the paper titled Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing, by Anastasiia Filippova and 3 other authors View PDF HTML (experimental) Abstract:Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI