Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

May 27, 2026 · 4:00 AM UTC ·2 min read · 0 reactions · 0 comments · 30 views

#artificial intelligence #machine learning #data privacy

TL;DR · WeSearch summary

The paper discusses Pretraining Data Exposure (PDE) in Large Language Models (LLMs), highlighting its implications for privacy and evaluation integrity. It presents a unified survey of membership inference and data contamination within the PDE framework. The authors formalize PDE, review existing methods, and identify future research directions.

Key facts

▪Large Language Models have become a key focus in natural language processing.
▪Pretraining Data Exposure concerns arise as model sizes and datasets grow.
▪The paper offers a comprehensive survey of membership inference and data contamination under the PDE framework.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Computation and Language arXiv:2605.26133 (cs) [Submitted on 21 May 2026] Title:Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications Authors:Ziyi Tong, Feifei Sun, Le Minh Nguyen View a PDF of the paper titled Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications, by Ziyi Tong and 2 other authors View PDF Abstract:Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

Discussion

More from arXiv cs.AI