{"ID":2838049,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18519","arxiv_id":"2511.18519","title":"CHIPS: Efficient CLIP Adaptation via Curvature-aware Hybrid Influence-based Data Selection","abstract":"Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute for large-scale datasets in CPT? We introduce CHIPS (Curvature-aware Hybrid Influence in Projection Subspace), which assigns each image-text pair a utility score that integrates three complementary factors aligned with three goals: faithfulness via a curvature-aware and Newton-style alignment computed in CLIP's end-point subspace; scalability via an InfoNCE-aware curvature estimator with Johnson-Lindenstrauss (JL) sketching; and retention via a selection-aware relevance weight combined with learnability to balance target adaptation against general-domain preservation. We justify this design theoretically by proving a lower-bound guarantee on the proxy's correlation with full-parameter alignment and by characterizing the bias-variance trade-offs introduced by curvature mixing and JL sketching. We evaluate CHIPS empirically across various settings: 1) CHIPS attains state-of-the-art performance among selection baselines on 17 medical benchmarks, matches full-dataset CPT with 30% of the data, and outperforms half-dataset CPT using only 10%; 2) on 31 general-domain benchmarks, CHIPS yields the least performance drop under all retention ratios.","short_abstract":"Adapting CLIP to vertical domains is typically approached by novel fine-tuning strategies or by continual pre-training (CPT) on large domain-specific datasets. Yet, data itself remains an underexplored factor in this process. We revisit this task from a data-centric perspective: Can effective data selection substitute...","url_abs":"https://arxiv.org/abs/2511.18519","url_pdf":"https://arxiv.org/pdf/2511.18519v2","authors":"[\"Xinlin Zhuang\",\"Yichen Li\",\"Xiwei Liu\",\"Haolin Yang\",\"Yifan Lu\",\"Ziyun Zou\",\"Yulong Li\",\"Huifa Li\",\"Dongliang Chen\",\"Qinglei Wang\",\"Weiyang Liu\",\"Ying Qian\",\"Jiangming Shi\",\"Imran Razzak\"]","published":"2025-11-23T16:25:42Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}
