{"ID":2881748,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.11133","arxiv_id":"2508.11133","title":"MoNaCo: More Natural and Complex Questions for Reasoning Across Dozens of Documents","abstract":"Automated agents, powered by Large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchmark of 1,315 natural and time-consuming questions that require dozens, and at times hundreds, of intermediate steps to solve -- far more than any existing QA benchmark. To build MoNaCo, we developed a decomposed annotation pipeline to elicit and manually answer real-world time-consuming questions at scale. Frontier LLMs evaluated on MoNaCo achieve at most 61.2% F1, hampered by low recall and hallucinations. Our results underscore the limitations of LLM-powered agents in handling the complexity and sheer breadth of real-world information-seeking tasks -- with MoNaCo providing an effective resource for tracking such progress. The MoNaCo benchmark, codebase, prompts and models predictions are all publicly available at: https://tomerwolgithub.github.io/monaco","short_abstract":"Automated agents, powered by Large language models (LLMs), are emerging as the go-to tool for querying information. However, evaluation benchmarks for LLM agents rarely feature natural questions that are both information-seeking and genuinely time-consuming for humans. To address this gap we introduce MoNaCo, a benchma...","url_abs":"https://arxiv.org/abs/2508.11133","url_pdf":"https://arxiv.org/pdf/2508.11133v2","authors":"[\"Tomer Wolfson\",\"Harsh Trivedi\",\"Mor Geva\",\"Yoav Goldberg\",\"Dan Roth\",\"Tushar Khot\",\"Ashish Sabharwal\",\"Reut Tsarfaty\"]","published":"2025-08-15T00:58:10Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.DB\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
