{"ID":2868635,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15642","arxiv_id":"2509.15642","title":"UNIV: Unified Foundation Model for Infrared and Visible Modalities","abstract":"Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superficial sensor patterns over underlying semantics. To address this problem, we introduce UNIV, a Unified foundation model for Infrared and Visible modalities. At the core of UNIV lies Patch Cross-modal Contrastive Learning (PCCL), a self-supervised contrastive learning strategy that constructs a unified cross-modal feature space. PCCL employs a frozen pre-trained model to sample pseudo patch pairs based on semantic similarity, and aligns infrared-visible representations by attracting semantically related pairs while repelling unrelated ones. This process simultaneously enhances cross-modal alignment and inter-class semantic separability, guiding the model to focus on semantic structure rather than falling into pattern shortcuts. To further enable cross-modal learning, we introduce MVIP, the most comprehensive visible-infrared benchmark to date, containing 98,992 precisely aligned image pairs across diverse scenes. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU for semantic segmentation and +0.7 mAP for detection), while maintaining competitive accuracy on RGB tasks.","short_abstract":"Joint RGB-infrared perception is essential for achieving robustness under diverse weather and illumination conditions. Although foundation models excel within single modalities, they suffer from substantial cross-modal degradation, an issue we attribute to a pattern shortcut, i.e., a modal bias that prioritizes superfi...","url_abs":"https://arxiv.org/abs/2509.15642","url_pdf":"https://arxiv.org/pdf/2509.15642v3","authors":"[\"Fangyuan Mao\",\"Shuo Wang\",\"Jilin Mei\",\"Shun Lu\",\"Chen Min\",\"Fuyang Liu\",\"Xiaokun Feng\",\"Meiqi Wu\",\"Yu Hu\"]","published":"2025-09-19T06:07:53Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
