{"ID":2889348,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.22062","arxiv_id":"2507.22062","title":"Meta CLIP 2: A Worldwide Scaling Recipe","abstract":"Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., \"curse of multilinguality\" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.","short_abstract":"Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to le...","url_abs":"https://arxiv.org/abs/2507.22062","url_pdf":"https://arxiv.org/pdf/2507.22062v3","authors":"[\"Yung-Sung Chuang\",\"Yang Li\",\"Dong Wang\",\"Ching-Feng Yeh\",\"Kehan Lyu\",\"Ramya Raghavendra\",\"James Glass\",\"Lifei Huang\",\"Jason Weston\",\"Luke Zettlemoyer\",\"Xinlei Chen\",\"Zhuang Liu\",\"Saining Xie\",\"Wen-tau Yih\",\"Shang-Wen Li\",\"Hu Xu\"]","published":"2025-07-29T17:59:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
