{"ID":2857521,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09274","arxiv_id":"2510.09274","title":"MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding","abstract":"Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \\texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg","short_abstract":"Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyfr...","url_abs":"https://arxiv.org/abs/2510.09274","url_pdf":"https://arxiv.org/pdf/2510.09274v1","authors":"[\"Ming Dai\",\"Sen Yang\",\"Boqiang Duan\",\"Wankou Yang\",\"Jingdong Wang\"]","published":"2025-10-10T11:18:21Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608458,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2857521,"paper_url":"https://arxiv.org/abs/2510.09274","paper_title":"MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding","repo_url":"https://github.com/Dmmm1997/MomentSeg","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}