{"ID":2838206,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.17946","arxiv_id":"2511.17946","title":"Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models","abstract":"Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucination is underexplored. Existing studies show that LLMs underperform on long-tail knowledge, i.e., the accuracy of the generated answer drops for the ground-truth entities that are rare in pretraining. However, examining whether data coverage itself can serve as a detection signal is overlooked. We propose a complementary question: Does lexical training-data coverage of the question and/or generated answer provide additional signal for hallucination detection? To investigate this, we construct scalable suffix arrays over RedPajama's 1.3-trillion-token pretraining corpus to retrieve $n$-gram statistics for both prompts and model generations. We evaluate their effectiveness for hallucination detection across three QA benchmarks. Our observations show that while occurrence-based features are weak predictors when used alone, they yield modest gains when combined with log-probabilities, particularly on datasets with higher intrinsic model uncertainty. These findings suggest that lexical coverage features provide a complementary signal for hallucination detection. All code and suffix-array infrastructure are provided at https://github.com/WWWonderer/ostd.","short_abstract":"Hallucination in large language models (LLMs) is a fundamental challenge, particularly in open-domain question answering. Prior work attempts to detect hallucination with model-internal signals such as token-level entropy or generation consistency, while the connection between pretraining data exposure and hallucinatio...","url_abs":"https://arxiv.org/abs/2511.17946","url_pdf":"https://arxiv.org/pdf/2511.17946v1","authors":"[\"Shuo Zhang\",\"Fabrizio Gotti\",\"Fengran Mo\",\"Jian-Yun Nie\"]","published":"2025-11-22T06:59:55Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606745,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2838206,"paper_url":"https://arxiv.org/abs/2511.17946","paper_title":"Measuring the Impact of Lexical Training Data Coverage on Hallucination Detection in Large Language Models","repo_url":"https://github.com/WWWonderer/ostd","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
