{"ID":2856656,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10415","arxiv_id":"2510.10415","title":"CQA-Eval: Designing Reliable Evaluations of Multi-paragraph Clinical QA under Resource Constraints","abstract":"Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and risk disclosure. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on communicates-risks remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.","short_abstract":"Evaluating multi-paragraph clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over multi-paragraph text is difficult. We introduce CQA-Eval, an evaluation framework and set of evaluation recommendations fo...","url_abs":"https://arxiv.org/abs/2510.10415","url_pdf":"https://arxiv.org/pdf/2510.10415v3","authors":"[\"Federica Bologna\",\"Tiffany Pan\",\"Matthew Wilkens\",\"Yue Guo\",\"Lucy Lu Wang\"]","published":"2025-10-12T02:49:04Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}