WeSearch

The Impact of AI-Generated Text on the Internet

·1 min read · 0 reactions · 0 comments · 2 views
#ai-generated text#internet archive#web crawling#text detection#pangram v3
⚡ TL;DR · AI summary

Estimating the volume of AI-generated text online is challenging due to the lack of a central index and biases in web crawls. Researchers used the Internet Archive and a stratified sampling method to analyze web content from 2022 to 2025. They tested four AI detection tools and found Pangram v3 to be the most reliable across various conditions.

Original article
Github
Read full at Github →
Opening excerpt (first ~120 words) tap to expand

How much new text on the internet is AI-generated? Answering this question is harder than it might seem. Constructing a statistically representative sample of the internet is difficult, as there is no central index, popular domains are vastly over-represented in most crawls, and archival coverage has shifted considerably over time. To work around this, we draw on the Internet Archive's Wayback Machine and apply a multi-dimensional stratified sampling approach, approximating a uniform random draw from publicly accessible web pages published between 2022 and 2025 (see Section 3.1 in our paper). On top of this sample, we need a reliable way to tell AI-generated and AI-assisted text apart from human-written text.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Github