{"ID":2885435,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.05851","arxiv_id":"2508.05851","title":"Temporal Cluster Assignment for Efficient Real-Time Video Segmentation","abstract":"Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency. Nevertheless, they fail to exploit temporal redundancy, missing a key opportunity to further optimize video segmentation performance. We introduce Temporal Cluster Assignment (TCA), a lightweight and effective, fine-tuning-free strategy that enhances token clustering by leveraging temporal coherence across frames. Instead of indiscriminately dropping redundant tokens, TCA refines token clusters using temporal correlations, thereby retaining fine-grained details while significantly reducing computation. Extensive evaluations on YouTube-VIS 2019, YouTube-VIS 2021, OVIS, and a private surgical video dataset show that TCA consistently boosts the accuracy-speed trade-off of existing clustering-based methods. Our results demonstrate that TCA generalizes competently across both natural and domain-specific videos.","short_abstract":"Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its win...","url_abs":"https://arxiv.org/abs/2508.05851","url_pdf":"https://arxiv.org/pdf/2508.05851v1","authors":"[\"Ka-Wai Yung\",\"Felix J. S. Bragman\",\"Jialang Xu\",\"Imanol Luengo\",\"Danail Stoyanov\",\"Evangelos B. Mazomenos\"]","published":"2025-08-07T20:52:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}