{"ID":2921561,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T03:09:48.883664427Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00987","arxiv_id":"2606.00987","title":"An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation","abstract":"Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \\textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \\textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \\textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \\textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.","short_abstract":"Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \\textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment langu...","url_abs":"https://arxiv.org/abs/2606.00987","url_pdf":"https://arxiv.org/pdf/2606.00987v1","authors":"[\"Bingyu Li\",\"Da Zhang\",\"Tao Huo\",\"Zhiyuan Zhao\",\"Junyu Gao\",\"Xuelong Li\"]","published":"2026-05-31T04:01:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}
