Still: Amortized KV Cache Compaction in a Single Forward Pass
The paper presents Still, a lightweight per-layer Perceiver that compacts KV caches in a single forward pass for long‑horizon language model inference. It demonstrates superior speed‑quality trade‑offs across a range of compression ratios and context lengths on models such as Qwen and Gemma. The method also improves summarization performance, surpassing strong baselines like KV‑Distill on benchmarks including RULER and LongBench.
- ▪The KV cache is identified as the primary memory bottleneck for deploying long‑horizon language models.
- ▪Existing compaction approaches either lack expressiveness (selection methods) or require per‑context optimization (synthesis methods).
- ▪Still trains a small per‑layer Perceiver once against a frozen base model to generate compact keys and values in a single forward pass.
- ▪Experiments on Qwen and Gemma models show Still achieving favorable speed‑quality results across 8× to 200× compression and 8k‑128k context lengths.
- ▪On the RULER benchmark, Still outperforms the strongest baseline by 8–22 points and wins a LongBench summarization comparison against KV‑Distill.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2606.07878 (cs) [Submitted on 5 Jun 2026] Title:Still: Amortized KV Cache Compaction in a Single Forward Pass Authors:Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby View a PDF of the paper titled Still: Amortized KV Cache Compaction in a Single Forward Pass, by Charles O'Neill and 4 other authors View PDF HTML (experimental) Abstract:The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.