{"ID":2888102,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01016","arxiv_id":"2508.01016","title":"Diagnostic Accuracy of Open-Source Vision-Language Models on Diverse Medical Imaging Tasks","abstract":"This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion identification), neonatal jaundice assessment (skin color-based treatment necessity), and retinal fundoscopy (5-point diabetic retinopathy grading). Diagnostic accuracy was compared in three experimental settings: visual input only, multimodal input, and chain-of-thought reasoning. Model accuracy was assessed against ground truth labels, with statistical comparisons using bootstrapped confidence intervals (p\u003c.05). Qwen2.5 achieved the highest accuracy for chest radiographs (90.4%) and endoscopy images (84.2%), significantly outperforming the other models (p\u003c.001). In colon pathology, Qwen2.5 (69.0%) and Phi-4 (69.6%) performed comparably (p=.41), both significantly exceeding other VLMs (p\u003c.001). Similarly, for neonatal jaundice assessment, Qwen2.5 (58.3%) and Phi-4 (58.1%) showed comparable leading accuracies (p=.93) significantly exceeding their counterparts (p\u003c.001). All models struggled with retinal fundoscopy; Qwen2.5 and Gemma3 achieved the highest, albeit modest, accuracies at 18.6% (comparable, p=.99), significantly better than other tested models (p\u003c.001). Unexpectedly, multimodal input reduced accuracy for some models and modalities, and chain-of-thought reasoning prompts also failed to improve accuracy. The open-source VLMs demonstrated promising diagnostic capabilities, particularly in chest radiograph interpretation. However, performance in complex domains such as retinal fundoscopy was limited, underscoring the need for further development and domain-specific adaptation before widespread clinical application.","short_abstract":"This retrospective study evaluated five VLMs (Qwen2.5, Phi-4, Gemma3, Llama3.2, and Mistral3.1) using the MedFMC dataset. This dataset includes 22,349 images from 7,461 patients encompassing chest radiography (19 disease multi-label classifications), colon pathology (tumor detection), endoscopy (colorectal lesion ident...","url_abs":"https://arxiv.org/abs/2508.01016","url_pdf":"https://arxiv.org/pdf/2508.01016v1","authors":"[\"Gustav Müller-Franzes\",\"Debora Jutz\",\"Jakob Nikolas Kather\",\"Christiane Kuhl\",\"Sven Nebelung\",\"Daniel Truhn\"]","published":"2025-08-01T18:28:37Z","proceeding":"eess.IV","tasks":"[\"eess.IV\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}