ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention
The paper introduces ThriftAttention, a method designed to improve the efficiency of attention algorithms in long-context workloads. It utilizes selective mixed precision to maintain quality while reducing computational costs. The approach shows significant performance recovery compared to traditional FP4 methods, especially as sequence lengths increase.
- ▪ThriftAttention employs a two-stage process to enhance attention computation efficiency.
- ▪By computing only 5% of query-key blocks in FP16, the method recovers an average of 89.1% of the performance gap between FP4 and FP16.
- ▪The technique addresses the quality degradation typically observed in long-context settings.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.23081 (cs) [Submitted on 21 May 2026] Title:ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention Authors:Joe Sharratt View a PDF of the paper titled ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention, by Joe Sharratt View PDF HTML (experimental) Abstract:Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.