{"ID":3084684,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T20:54:36.964885582Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05436","arxiv_id":"2606.05436","title":"Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison","abstract":"Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.","short_abstract":"Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language...","url_abs":"https://arxiv.org/abs/2606.05436","url_pdf":"https://arxiv.org/pdf/2606.05436v1","authors":"[\"Alejandro Lozano\",\"Keiko Ihara\",\"Ping-Hao Yang\",\"Carrie E. Robertson\",\"Jennifer Stern\",\"Allan Purdy\",\"Hsiangkuo Yuan\",\"Pengfei Zhang\",\"Yulia Orlova\",\"Olga Fermo\",\"Jennifer Hranilovich\",\"Fred Cohen\",\"Todd J. Schwedt\",\"Jenelle A. Jindal\",\"Serena Yeung-Levy\",\"Chia-Chun Chiang\"]","published":"2026-06-03T20:58:43Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.IR\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
