{"ID":2898918,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03152","arxiv_id":"2507.03152","title":"MedVAL: Toward Expert-Level Medical Text Validation with Language Models","abstract":"With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the \"LLM-as-a-judge\" paradigm offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. We propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source and proprietary models, MedVAL distillation significantly improves (p \u003c 0.001) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66% to 83%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert on a subset annotated by multiple physicians (p \u003c 0.001). To support a scalable, risk-aware pathway towards clinical integration, we open-source: 1) Codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.","short_abstract":"With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review...","url_abs":"https://arxiv.org/abs/2507.03152","url_pdf":"https://arxiv.org/pdf/2507.03152v5","authors":"[\"Asad Aali\",\"Vasiliki Bikia\",\"Maya Varma\",\"Nicole Chiou\",\"Sophie Ostmeier\",\"Arnav Singhvi\",\"Magdalini Paschali\",\"Ashwin Kumar\",\"Andrew Johnston\",\"Karimar Amador-Martinez\",\"Eduardo Juan Perez Guerrero\",\"Paola Naovi Cruz Rivera\",\"Sergios Gatidis\",\"Christian Bluethgen\",\"Eduardo Pontes Reis\",\"Eddy D. Zandee van Rilland\",\"Poonam Laxmappa Hosamani\",\"Kevin R Keet\",\"Minjoung Go\",\"Evelyn Ling\",\"David B. Larson\",\"Curtis Langlotz\",\"Roxana Daneshjou\",\"Jason Hom\",\"Sanmi Koyejo\",\"Emily Alsentzer\",\"Akshay S. Chaudhari\"]","published":"2025-07-03T20:19:18Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612438,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2898918,"paper_url":"https://arxiv.org/abs/2507.03152","paper_title":"MedVAL: Toward Expert-Level Medical Text Validation with Language Models","repo_url":"https://github.com/StanfordMIMI/MedVAL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
