{"ID":2873676,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.05878","arxiv_id":"2509.05878","title":"MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries","abstract":"Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an \"LLM Jury\"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P \u003c 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.","short_abstract":"Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for...","url_abs":"https://arxiv.org/abs/2509.05878","url_pdf":"https://arxiv.org/pdf/2509.05878v1","authors":"[\"François Grolleau\",\"Emily Alsentzer\",\"Timothy Keyes\",\"Philip Chung\",\"Akshay Swaminathan\",\"Asad Aali\",\"Jason Hom\",\"Tridu Huynh\",\"Thomas Lew\",\"April S. Liang\",\"Weihan Chu\",\"Natasha Z. Steele\",\"Christina F. Lin\",\"Jingkun Yang\",\"Kameron C. Black\",\"Stephen P. Ma\",\"Fateme N. Haredasht\",\"Nigam H. Shah\",\"Kevin Schulman\",\"Jonathan H. Chen\"]","published":"2025-09-07T00:41:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
