{"ID":2856040,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10909","arxiv_id":"2510.10909","title":"PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature","abstract":"Understanding and reasoning on the large-scale scientific literature is a crucial touchstone for large language model (LLM) based agents. However, existing works are mainly restricted to tool-free tasks within single papers, largely due to the lack of a benchmark that evaluates cross-paper reasoning and multi-tool orchestration in authentic research scenarios. In this work, we propose PaperArena, a benchmark to evaluate LLM-based agents on questions that require integrating information across multiple papers with the assistance of external tools. Given a research question, agents should formulate a reasoning plan, interact with multiple papers, and invoke appropriate tools to produce a well-grounded answer. To support standardized evaluation, we provide a platform for agent execution, offering a modular tool environment including multimodal parsing, context retrieval, and programmatic computation. Experiments reveal that even the leading LLM powering a well-established agentic workflow achieves merely 38.78% average accuracy, while on the hard subset, accuracy drops to only 18.47%. We also analyze reasoning traces and diagnose agent behavior, providing the community with insights to develop and evaluate more capable scientific agents.","short_abstract":"Understanding and reasoning on the large-scale scientific literature is a crucial touchstone for large language model (LLM) based agents. However, existing works are mainly restricted to tool-free tasks within single papers, largely due to the lack of a benchmark that evaluates cross-paper reasoning and multi-tool orch...","url_abs":"https://arxiv.org/abs/2510.10909","url_pdf":"https://arxiv.org/pdf/2510.10909v4","authors":"[\"Daoyu Wang\",\"Mingyue Cheng\",\"Shuo Yu\",\"Zirui Liu\",\"Ze Guo\",\"Xin Li\",\"Qi Liu\"]","published":"2025-10-13T02:10:39Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
