{"ID":2884827,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06434","arxiv_id":"2508.06434","title":"CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment","abstract":"Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive language-image pretraining (CLIP): they hinder the model's ability to learn robust and generalizable representations. In this work, we propose CLIPin, a unified non-contrastive plug-in that can be seamlessly integrated into CLIP-style architectures to improve multimodal semantic alignment, providing stronger supervision and enhancing alignment robustness. Furthermore, two shared pre-projectors are designed for image and text modalities respectively to facilitate the integration of contrastive and non-contrastive learning in a parameter-compromise manner. Extensive experiments on diverse downstream tasks demonstrate the effectiveness and generality of CLIPin as a plug-and-play component compatible with various contrastive frameworks. Code is available at https://github.com/T6Yang/CLIPin.","short_abstract":"Large-scale natural image-text datasets, especially those automatically collected from the web, often suffer from loose semantic alignment due to weak supervision, while medical datasets tend to have high cross-modal correlation but low content diversity. These properties pose a common challenge for contrastive languag...","url_abs":"https://arxiv.org/abs/2508.06434","url_pdf":"https://arxiv.org/pdf/2508.06434v2","authors":"[\"Shengzhu Yang\",\"Jiawei Du\",\"Shuai Lu\",\"Weihang Zhang\",\"Ningli Wang\",\"Huiqi Li\"]","published":"2025-08-08T16:23:05Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":611123,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2884827,"paper_url":"https://arxiv.org/abs/2508.06434","paper_title":"CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment","repo_url":"https://github.com/T6Yang/CLIPin","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}