{"ID":2877730,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20033","arxiv_id":"2508.20033","title":"DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis","abstract":"The ability to research and synthesize knowledge is central to human expertise and progress. A new class of AI systems--designed for generative research synthesis--aims to automate this process by retrieving information from the live web and producing long-form, cited reports. Yet, evaluating such systems remains an open challenge: existing question-answering benchmarks focus on short, factual answers, while expert-curated datasets risk staleness and data contamination. Neither captures the complexity and evolving nature of real research synthesis tasks. We introduce DeepScholar-bench, a live benchmark and automated evaluation framework for generative research synthesis. DeepScholar-bench draws queries and human-written exemplars from recent, high-quality ArXiv papers and evaluates a real synthesis task: generating a related work section by retrieving, synthesizing, and citing prior work. Our automated framework holistically measures performance across three key dimensions--knowledge synthesis, retrieval quality, and verifiability. To further future work, we also contribute DeepScholar-ref, a simple, open-source reference pipeline, which is implemented on the LOTUS framework and provides a strong baseline. Using DeepScholar-bench, we systematically evaluate prior open-source systems, search agents with strong models, OpenAI's DeepResearch, and DeepScholar-ref. We find DeepScholar-bench is far from saturated: no system surpasses a geometric mean of $31\\%$ across all metrics. These results highlight both the difficulty and importance of DeepScholar-bench as a foundation for advancing AI systems capable of generative research synthesis. We make our benchmark code and data available at https://github.com/guestrin-lab/deepscholar-bench.","short_abstract":"The ability to research and synthesize knowledge is central to human expertise and progress. A new class of AI systems--designed for generative research synthesis--aims to automate this process by retrieving information from the live web and producing long-form, cited reports. Yet, evaluating such systems remains an op...","url_abs":"https://arxiv.org/abs/2508.20033","url_pdf":"https://arxiv.org/pdf/2508.20033v2","authors":"[\"Liana Patel\",\"Negar Arabzadeh\",\"Harshit Gupta\",\"Ankita Sundar\",\"Ion Stoica\",\"Matei Zaharia\",\"Carlos Guestrin\"]","published":"2025-08-27T16:36:34Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":610415,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877730,"paper_url":"https://arxiv.org/abs/2508.20033","paper_title":"DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis","repo_url":"https://github.com/guestrin-lab/deepscholar-bench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}