{"ID":2882833,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09848","arxiv_id":"2508.09848","title":"PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts","abstract":"We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by \u003e15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.","short_abstract":"We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the...","url_abs":"https://arxiv.org/abs/2508.09848","url_pdf":"https://arxiv.org/pdf/2508.09848v2","authors":"[\"Mo Yu\",\"Tsz Ting Chung\",\"Chulun Zhou\",\"Tong Li\",\"Rui Lu\",\"Jiangnan Li\",\"Liyan Xu\",\"Haoshu Lu\",\"Ning Zhang\",\"Jing Li\",\"Jie Zhou\"]","published":"2025-08-13T14:28:25Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}