{"ID":2862574,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25868","arxiv_id":"2509.25868","title":"ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations","abstract":"The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit's r/AskScience. Evaluating 9 state-of-the-art LLMs reveals two critical limitations. First, models exhibit a dominant \"salient distractor\" failure mode: 61% of incorrect span predictions are semantically unrelated to actual errors. Crucially, this pattern persists across all model scales (1B to 70B), indicating a fundamental semantic grounding deficit that scaling alone fails to resolve. Second, we find that comparative judgment is paradoxically harder than independent detection, even GPT-4o's F1 score drops from 0.67 to 0.53 when comparing answers side-by-side. These findings directly challenge the reliability of LLM-as-Judge paradigms for scientific factuality. Code and data are released at https://github.com/ddz5431/ReFACT.","short_abstract":"The mechanisms underlying scientific confabulation in Large Language Models (LLMs) remain poorly understood. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs with span-level error annotations derived from Reddit's r/AskScience. Evaluating 9 state-of-the-a...","url_abs":"https://arxiv.org/abs/2509.25868","url_pdf":"https://arxiv.org/pdf/2509.25868v3","authors":"[\"Yindong Wang\",\"Martin Preiß\",\"Margarita Bugueño\",\"Jan Vincent Hoffbauer\",\"Abdullatif Ghajar\",\"Tolga Buz\",\"Gerard de Melo\"]","published":"2025-09-30T07:06:23Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608910,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2862574,"paper_url":"https://arxiv.org/abs/2509.25868","paper_title":"ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations","repo_url":"https://github.com/ddz5431/ReFACT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
