{"ID":2864483,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24109","arxiv_id":"2509.24109","title":"SVAC: Scaling Is All You Need For Referring Video Object Segmentation","abstract":"Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.","short_abstract":"Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insuffici...","url_abs":"https://arxiv.org/abs/2509.24109","url_pdf":"https://arxiv.org/pdf/2509.24109v1","authors":"[\"Li Zhang\",\"Haoxiang Gao\",\"Zhihao Zhang\",\"Luoxiao Huang\",\"Tao Zhang\"]","published":"2025-09-28T23:02:09Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609155,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864483,"paper_url":"https://arxiv.org/abs/2509.24109","paper_title":"SVAC: Scaling Is All You Need For Referring Video Object Segmentation","repo_url":"https://github.com/lizhang1998/SVAC","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
