{"ID":2839206,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.16618","arxiv_id":"2511.16618","title":"SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking","abstract":"Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \\textbf{SAM2} for \\textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\\mathcal{J}$\\\u0026$\\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\\mathcal{J}$\\\u0026$\\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.","short_abstract":"Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face ch...","url_abs":"https://arxiv.org/abs/2511.16618","url_pdf":"https://arxiv.org/pdf/2511.16618v1","authors":"[\"Haofeng Liu\",\"Ziyue Wang\",\"Sudhanshu Mishra\",\"Mingqi Gao\",\"Guanyi Qin\",\"Chang Han Low\",\"Alex Y. W. Kong\",\"Yueming Jin\"]","published":"2025-11-20T18:18:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"eess.IV\",\"q-bio.TO\"]","methods":"[]","has_code":false}