{"ID":2878250,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.19363","arxiv_id":"2508.19363","title":"LongReasonArena: A Long Reasoning Benchmark for Large Language Models","abstract":"Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.","short_abstract":"Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our ta...","url_abs":"https://arxiv.org/abs/2508.19363","url_pdf":"https://arxiv.org/pdf/2508.19363v1","authors":"[\"Jiayu Ding\",\"Shuming Ma\",\"Lei Cui\",\"Nanning Zheng\",\"Furu Wei\"]","published":"2025-08-26T18:41:53Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610469,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2878250,"paper_url":"https://arxiv.org/abs/2508.19363","paper_title":"LongReasonArena: A Long Reasoning Benchmark for Large Language Models","repo_url":"https://github.com/LongReasonArena/LongReasonArena","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}