{"ID":2857355,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09031","arxiv_id":"2510.09031","title":"Web Crawler Restrictions, AI Training Datasets \\\u0026amp; Political Biases","abstract":"Large language models rely on web-scraped text for training; concurrently, content creators are increasingly blocking AI crawlers to retain control over their data. We analyze crawler restrictions across the top one million most-visited websites since 2023 and examine their potential downstream effects on training data composition. Our analysis reveals growing restrictions, with blocking patterns varying by website popularity and content type. A quarter of the top thousand websites restrict AI crawlers, decreasing to one-tenth across the broader top million. Content type matters significantly: 34.2% of news outlets disallow OpenAI's GPTBot, rising to 55% for outlets with high factual reporting. Additionally, outlets with neutral political positions impose the strongest restrictions (58%), whereas hyperpartisan websites and those with low factual reporting impose fewer restrictions -only 4.1% of right-leaning outlets block access to OpenAI. Our findings suggest that heterogeneous blocking patterns may skew training datasets toward low-quality or polarized content, potentially affecting the capabilities of models served by prominent AI-as-a-Service providers.","short_abstract":"Large language models rely on web-scraped text for training; concurrently, content creators are increasingly blocking AI crawlers to retain control over their data. We analyze crawler restrictions across the top one million most-visited websites since 2023 and examine their potential downstream effects on training data...","url_abs":"https://arxiv.org/abs/2510.09031","url_pdf":"https://arxiv.org/pdf/2510.09031v1","authors":"[\"Paul Bouchaud\",\"Pedro Ramaciotti\"]","published":"2025-10-10T06:06:05Z","proceeding":"cs.SI","tasks":"[\"cs.SI\"]","methods":"[\"Language Model\"]","has_code":false}
