{"ID":2864940,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21766","arxiv_id":"2509.21766","title":"UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios","abstract":"Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \\textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \\textbf{200k+} tokens and \\textbf{400+} tool calls, whereas in standard configurations they still exceed \\textbf{35k} tokens and involve more than \\textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \\href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}","short_abstract":"Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and par...","url_abs":"https://arxiv.org/abs/2509.21766","url_pdf":"https://arxiv.org/pdf/2509.21766v1","authors":"[\"Haotian Luo\",\"Huaisong Zhang\",\"Xuelin Zhang\",\"Haoyu Wang\",\"Zeyu Qin\",\"Wenjie Lu\",\"Guozheng Ma\",\"Haiying He\",\"Yingsha Xie\",\"Qiyang Zhou\",\"Zixuan Hu\",\"Hongze Mi\",\"Yibo Wang\",\"Naiqiang Tan\",\"Hong Chen\",\"Yi R. Fung\",\"Chun Yuan\",\"Li Shen\"]","published":"2025-09-26T02:04:00Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":609215,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864940,"paper_url":"https://arxiv.org/abs/2509.21766","paper_title":"UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios","repo_url":"https://github.com/StarDewXXX/UltraHorizon","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}