{"ID":2862752,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.26184","arxiv_id":"2509.26184","title":"Auto-ARGUE: LLM-Based Report Generation Evaluation","abstract":"Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for report generation evaluation. We present analysis of Auto-ARGUE on the report generation pilot task from the TREC 2024 NeuCLIR track and on two tasks from the TREC 2024 RAG track, showing good system-level correlations with human judgments. Additionally, we release ARGUE-Viz, a web app for visualization and fine-grained analysis of Auto-ARGUE judgments and scores.","short_abstract":"Generation of citation-backed reports is a primary use case for retrieval-augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, tools designed for report generation are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed...","url_abs":"https://arxiv.org/abs/2509.26184","url_pdf":"https://arxiv.org/pdf/2509.26184v5","authors":"[\"William Walden\",\"Marc Mason\",\"Orion Weller\",\"Laura Dietz\",\"John Conroy\",\"Neil Molino\",\"Hannah Recknor\",\"Bryan Li\",\"Gabrielle Kaili-May Liu\",\"Yu Hou\",\"Dawn Lawrie\",\"James Mayfield\",\"Eugene Yang\"]","published":"2025-09-30T12:41:11Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.AI\",\"cs.CL\"]","methods":"[\"RAG\",\"Large Language Model\"]","has_code":false}