{"ID":3053224,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-05T19:35:40.366641076Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04177","arxiv_id":"2606.04177","title":"A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models","abstract":"Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this gap, we conduct a large-scale empirical study assessing the robustness of linguistic signals for characterizing AI-generated text. Our analysis covers 284 interpretable linguistic features across outputs from 27 LLMs and ten text domains under cross-model and cross-domain generalization settings. We show that classifiers based solely on linguistic features can reliably distinguish AI-generated from human-written text. However, many previously proposed indicators prove strongly context-dependent, with the exception of measures of lexical richness, which remain robust signals across model families and text domains. These results demonstrate which linguistic signals generalize across contexts and provide a foundation for more reliable, interpretable analyses of AI-generated language.","short_abstract":"Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for non-expert users. However, existing findings on which features reliably indicate LLM-generated text remain fragmented across feature sets, models, and text domains. To address this ga...","url_abs":"https://arxiv.org/abs/2606.04177","url_pdf":"https://arxiv.org/pdf/2606.04177v1","authors":"[\"Yassir El Attar\",\"Esra Dönmez\",\"Maximilian Maurer\",\"Agnieszka Falenska\"]","published":"2026-06-02T19:46:22Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
