{"ID":2866095,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21269","arxiv_id":"2509.21269","title":"LLMTrace: A Corpus for Classification and Fine-Grained Localization of AI-Written Text","abstract":"The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the increasingly common scenario of mixed human-AI authorship. Crucially, while some datasets address mixed authorship, none provide the character-level annotations required for the precise localization of AI-generated segments within a text. To address these gaps, we introduce LLMTrace, a new large-scale, bilingual (English and Russian) corpus for AI-generated text detection. Constructed using a diverse range of modern proprietary and open-source LLMs, our dataset is designed to support two key tasks: traditional full-text binary classification (human vs. AI) and the novel task of AI-generated interval detection, facilitated by character-level annotations. We believe LLMTrace will serve as a vital resource for training and evaluating the next generation of more nuanced and practical AI detection models. The project page is available at \\href{https://sweetdream779.github.io/LLMTrace-info/}{iitolstykh/LLMTrace}.","short_abstract":"The widespread use of human-like text from Large Language Models (LLMs) necessitates the development of robust detection systems. However, progress is limited by a critical lack of suitable training data; existing datasets are often generated with outdated models, are predominantly in English, and fail to address the i...","url_abs":"https://arxiv.org/abs/2509.21269","url_pdf":"https://arxiv.org/pdf/2509.21269v1","authors":"[\"Irina Tolstykh\",\"Aleksandra Tsybina\",\"Sergey Yakubson\",\"Maksim Kuprashevich\"]","published":"2025-09-25T14:59:43Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
