{"ID":2835055,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00271","arxiv_id":"2512.00271","title":"Comparative Evaluation of Generative AI Models for Chest Radiograph Report Generation in the Emergency Department","abstract":"Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory symptoms. Reports from five VLMs (AIRead, Lingshu, MAIRA-2, MedGemma, and MedVersa) and radiologist-written reports were randomly presented and blindly evaluated by three thoracic radiologists using four criteria: RADPEER, clinical acceptability, hallucination, and language clarity. Comparative performance was assessed using generalized linear mixed models, with radiologist-written reports treated as the reference. Finding-level analyses were also performed with CT as the reference. Results: A total of 478 patients (median age, 67 years [interquartile range, 50-78]; 282 men [59.0%]) were included. AIRead demonstrated the lowest RADPEER 3b rate (5.3% [76/1434] vs. radiologists 13.9% [200/1434]; P\u003c.001), whereas other VLMs showed higher disagreement rates (16.8-43.0%; P\u003c.05). Clinical acceptability was the highest with AIRead (84.5% [1212/1434] vs. radiologists 74.3% [1065/1434]; P\u003c.001), while other VLMs performed worse (41.1-71.4%; P\u003c.05). Hallucinations were rare with AIRead, comparable to radiologists (0.3% [4/1425]) vs. 0.1% [1/1425]; P=.21), but frequent with other models (5.4-17.4%; P\u003c.05). Language clarity was higher with AIRead (82.9% [1189/1434]), Lingshu (88.0% [1262/1434]), and MedVersa (88.4% [1268/1434]) compared with radiologists (78.1% [1120/1434]; P\u003c.05). Sensitivity varied substantially across VLMs for the common findings: AIRead, 15.5-86.7%; Lingshu, 2.4-86.7%; MAIRA-2, 6.0-72.0%; MedGemma, 4.8-76.7%; and MedVersa, 20.2-69.3%. Conclusion: Medical VLMs for CXR report generation exhibited variable performance in report quality and diagnostic measures.","short_abstract":"Purpose: To benchmark open-source or commercial medical image-specific VLMs against real-world radiologist-written reports. Methods: This retrospective study included adult patients who presented to the emergency department between January 2022 and April 2025 and underwent same-day CXR and CT for febrile or respiratory...","url_abs":"https://arxiv.org/abs/2512.00271","url_pdf":"https://arxiv.org/pdf/2512.00271v1","authors":"[\"Woo Hyeon Lim\",\"Ji Young Lee\",\"Jong Hyuk Lee\",\"Saehoon Kim\",\"Hyungjin Kim\"]","published":"2025-11-29T01:45:55Z","proceeding":"eess.IV","tasks":"[\"eess.IV\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}
