{"ID":2851386,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20994","arxiv_id":"2510.20994","title":"VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models","abstract":"Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.","short_abstract":"Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-super...","url_abs":"https://arxiv.org/abs/2510.20994","url_pdf":"https://arxiv.org/pdf/2510.20994v1","authors":"[\"Jesimon Barreto\",\"Carlos Caetano\",\"André Araujo\",\"William Robson Schwartz\"]","published":"2025-10-23T20:44:28Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607897,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2851386,"paper_url":"https://arxiv.org/abs/2510.20994","paper_title":"VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models","repo_url":"https://github.com/jesimonbarreto/VESSA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
