{"ID":2870495,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.13175","arxiv_id":"2509.13175","title":"More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era","abstract":"The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (\u003e96\\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale \"silver-standard\" datasets at a minimal cost (~\\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this \"silver-standard\" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\\% AUC for zero-shot diagnosis on CT-RATE, 77.3\\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\\% for image-image, Recall@100=52.2\\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.","short_abstract":"The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs...","url_abs":"https://arxiv.org/abs/2509.13175","url_pdf":"https://arxiv.org/pdf/2509.13175v1","authors":"[\"Yingtai Li\",\"Haoran Lai\",\"Xiaoqian Zhou\",\"Shuai Ming\",\"Wenxin Ma\",\"Wei Wei\",\"Shaohua Kevin Zhou\"]","published":"2025-09-16T15:27:14Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609773,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2870495,"paper_url":"https://arxiv.org/abs/2509.13175","paper_title":"More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era","repo_url":"https://github.com/SadVoxel/More-performant-and-scalable","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}