{"ID":2828424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14230","arxiv_id":"2512.14230","title":"Understanding the Gain from Data Filtering in Multimodal Contrastive Learning","abstract":"The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution, leveraging a pre-trained model to compute quality scores. To explain the empirical success of teacher-based filtering, we characterize the performance of filtered contrastive learning under the standard bimodal data generation model. Denoting $η\\in(0,1]$ as the fraction of data with correctly matched modalities among $n$ paired samples, we utilize a linear contrastive learning setup to show a provable benefit of data filtering: $(i)$ the error without filtering is upper and lower bounded by $\\frac{1}{η\\sqrt{n}}$, and $(ii)$ the error with teacher-based filtering is upper bounded by $\\frac{1}{\\sqrt{ηn}}$ in the large $η$ regime, and by $\\frac{1}{\\sqrt{n}}$ in the small $η$ regime.","short_abstract":"The success of modern multimodal representation learning relies on internet-scale datasets. Due to the low quality of a large fraction of raw web data, data curation has become a critical step in the training pipeline. Filtering using a trained model (i.e., teacher-based filtering) has emerged as a successful solution,...","url_abs":"https://arxiv.org/abs/2512.14230","url_pdf":"https://arxiv.org/pdf/2512.14230v1","authors":"[\"Divyansh Pareek\",\"Sewoong Oh\",\"Simon S. Du\"]","published":"2025-12-16T09:28:38Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"stat.ML\"]","methods":"[]","has_code":false}