{"ID":2846578,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.01357","arxiv_id":"2511.01357","title":"CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering","abstract":"Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.","short_abstract":"Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. T...","url_abs":"https://arxiv.org/abs/2511.01357","url_pdf":"https://arxiv.org/pdf/2511.01357v1","authors":"[\"Qiangguo Jin\",\"Xianyao Zheng\",\"Hui Cui\",\"Changming Sun\",\"Yuqi Fang\",\"Cong Cong\",\"Ran Su\",\"Leyi Wei\",\"Ping Xuan\",\"Junbo Wang\"]","published":"2025-11-03T09:05:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":607443,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2846578,"paper_url":"https://arxiv.org/abs/2511.01357","paper_title":"CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering","repo_url":"https://github.com/BioMedIA-repo/CMI-MTL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
