{"ID":2869365,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14977","arxiv_id":"2509.14977","title":"EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence","abstract":"Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.","short_abstract":"Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diag...","url_abs":"https://arxiv.org/abs/2509.14977","url_pdf":"https://arxiv.org/pdf/2509.14977v2","authors":"[\"Chaoyin She\",\"Ruifang Lu\",\"Lida Chen\",\"Wei Wang\",\"Qinghua Huang\"]","published":"2025-09-18T14:07:53Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Mixture of Experts\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":609676,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2869365,"paper_url":"https://arxiv.org/abs/2509.14977","paper_title":"EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence","repo_url":"https://github.com/Asunatan/EchoVLM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
