{"ID":2877140,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20691","arxiv_id":"2508.20691","title":"MobileCLIP2: Improving Multi-Modal Reinforced Training","abstract":"Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2$\\times$ smaller and improves on DFN ViT-L/14 at 2.5$\\times$ lower latency. We release our pretrained models (https://github.com/apple/ml-mobileclip) and the data generation code (https://github.com/apple/ml-mobileclip-dr). The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.","short_abstract":"Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures...","url_abs":"https://arxiv.org/abs/2508.20691","url_pdf":"https://arxiv.org/pdf/2508.20691v1","authors":"[\"Fartash Faghri\",\"Pavan Kumar Anasosalu Vasu\",\"Cem Koc\",\"Vaishaal Shankar\",\"Alexander Toshev\",\"Oncel Tuzel\",\"Hadi Pouransari\"]","published":"2025-08-28T11:50:22Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[]","has_code":false,"code_links":[{"ID":610344,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877140,"paper_url":"https://arxiv.org/abs/2508.20691","paper_title":"MobileCLIP2: Improving Multi-Modal Reinforced Training","repo_url":"https://github.com/apple/ml-mobileclip","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":610345,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877140,"paper_url":"https://arxiv.org/abs/2508.20691","paper_title":"MobileCLIP2: Improving Multi-Modal Reinforced Training","repo_url":"https://github.com/apple/ml-mobileclip-dr","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}