{"ID":2892186,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15480","arxiv_id":"2507.15480","title":"One Last Attention for Your Vision-Language Model","abstract":"Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \\emph{\\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \\textbf{R}ational \\textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \\href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.","short_abstract":"Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused repr...","url_abs":"https://arxiv.org/abs/2507.15480","url_pdf":"https://arxiv.org/pdf/2507.15480v2","authors":"[\"Liang Chen\",\"Ghazi Shazan Ahmad\",\"Tianjun Yao\",\"Lingqiao Liu\",\"Zhiqiang Shen\"]","published":"2025-07-21T10:35:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611964,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2892186,"paper_url":"https://arxiv.org/abs/2507.15480","paper_title":"One Last Attention for Your Vision-Language Model","repo_url":"https://github.com/khufia/RAda","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
