{"ID":2856542,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11835","arxiv_id":"2510.11835","title":"Data or Language Supervision: What Makes CLIP Better than DINO?","abstract":"CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the same architecture, dataset, and training configuration -- achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.","short_abstract":"CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP's language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings -- using the s...","url_abs":"https://arxiv.org/abs/2510.11835","url_pdf":"https://arxiv.org/pdf/2510.11835v1","authors":"[\"Yiming Liu\",\"Yuhui Zhang\",\"Dhruba Ghosh\",\"Ludwig Schmidt\",\"Serena Yeung-Levy\"]","published":"2025-10-13T18:34:58Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\",\"cs.LG\",\"cs.MM\"]","methods":"[\"Language Model\"]","has_code":false}
