Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
The paper discusses Pretraining Data Exposure (PDE) in Large Language Models (LLMs), highlighting its implications for privacy and evaluation integrity. It presents a unified survey of membership inference and data contamination within the PDE framework. The authors formalize PDE, review existing methods, and identify future research directions.
- ▪Large Language Models have become a key focus in natural language processing.
- ▪Pretraining Data Exposure concerns arise as model sizes and datasets grow.
- ▪The paper offers a comprehensive survey of membership inference and data contamination under the PDE framework.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computation and Language arXiv:2605.26133 (cs) [Submitted on 21 May 2026] Title:Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications Authors:Ziyi Tong, Feifei Sun, Le Minh Nguyen View a PDF of the paper titled Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications, by Ziyi Tong and 2 other authors View PDF Abstract:Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.