{"ID":2888215,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03733","arxiv_id":"2508.03733","title":"CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning","abstract":"Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on \"one-time\" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved \"think-answer\" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.","short_abstract":"Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency...","url_abs":"https://arxiv.org/abs/2508.03733","url_pdf":"https://arxiv.org/pdf/2508.03733v1","authors":"[\"Wenjie Li\",\"Yujie Zhang\",\"Haoran Sun\",\"Yueqi Li\",\"Fanrui Zhang\",\"Mengzhe Xu\",\"Victoria Borja Clausich\",\"Sade Mellin\",\"Renhao Yang\",\"Chenrun Wang\",\"Jethro Zih-Shuo Wang\",\"Shiyi Yao\",\"Gen Li\",\"Yidong Xu\",\"Hanyu Wang\",\"Yilin Huang\",\"Angela Lin Wang\",\"Chen Shi\",\"Yin Zhang\",\"Jianan Guo\",\"Luqi Yang\",\"Renxuan Li\",\"Yang Xu\",\"Jiawei Liu\",\"Yao Zhang\",\"Lei Liu\",\"Carlos Gutiérrez SanRomán\",\"Lei Wang\"]","published":"2025-07-31T05:07:18Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}
