{"ID":2856048,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10921","arxiv_id":"2510.10921","title":"FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model","abstract":"Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.","short_abstract":"Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained detail...","url_abs":"https://arxiv.org/abs/2510.10921","url_pdf":"https://arxiv.org/pdf/2510.10921v3","authors":"[\"Chunyu Xie\",\"Bin Wang\",\"Fanjing Kong\",\"Jincheng Li\",\"Dawei Liang\",\"Ji Ao\",\"Dawei Leng\",\"Yuhui Yin\"]","published":"2025-10-13T02:32:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}