Contrastive Decoding Diffing: Recovering Finetuning Data Without Weight Access
A new method called Contrastive Decoding Diffing (CDD) has been introduced to recover finetuning data from language models without needing access to their weights. This approach outperforms existing methods by recovering implanted facts verbatim across various model architectures. CDD demonstrates practical utility for enhancing transparency and accountability in AI systems.
- ▪Contrastive Decoding Diffing (CDD) operates on output-level logit distributions without requiring weight access.
- ▪CDD successfully recovers exact drug names, vote counts, and other factual details across multiple model architectures.
- ▪The method also identifies unintended data artifacts, marking a significant advancement in model auditing.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.25902 (cs) [Submitted on 25 May 2026] Title:Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing Authors:Michał Brzozowski, Zuzanna Dubanowska, Enrico Cassano, Neo Christopher Chung View a PDF of the paper titled Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing, by Micha{\l} Brzozowski and 3 other authors View PDF HTML (experimental) Abstract:Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.