{"ID":2890229,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20025","arxiv_id":"2507.20025","title":"Region-based Cluster Discrimination for Visual Representation Learning","abstract":"Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.","short_abstract":"Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effec...","url_abs":"https://arxiv.org/abs/2507.20025","url_pdf":"https://arxiv.org/pdf/2507.20025v1","authors":"[\"Yin Xie\",\"Kaicheng Yang\",\"Xiang An\",\"Kun Wu\",\"Yongle Zhao\",\"Weimo Deng\",\"Zimin Ran\",\"Yumeng Wang\",\"Ziyong Feng\",\"Roy Miles\",\"Ismail Elezi\",\"Jiankang Deng\"]","published":"2025-07-26T17:47:09Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611757,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2890229,"paper_url":"https://arxiv.org/abs/2507.20025","paper_title":"Region-based Cluster Discrimination for Visual Representation Learning","repo_url":"https://github.com/deepglint/MVT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}