WeSearch

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

·2 min read · 0 reactions · 0 comments · 7 views
#artificial intelligence#machine learning#data privacy
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
⚡ TL;DR · AI summary

The paper discusses Pretraining Data Exposure (PDE) in Large Language Models (LLMs), highlighting its implications for privacy and evaluation integrity. It presents a unified survey of membership inference and data contamination within the PDE framework. The authors formalize PDE, review existing methods, and identify future research directions.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Computation and Language arXiv:2605.26133 (cs) [Submitted on 21 May 2026] Title:Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications Authors:Ziyi Tong, Feifei Sun, Le Minh Nguyen View a PDF of the paper titled Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications, by Ziyi Tong and 2 other authors View PDF Abstract:Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI