{"ID":2885775,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04418","arxiv_id":"2508.04418","title":"Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation","abstract":"Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.","short_abstract":"Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lac...","url_abs":"https://arxiv.org/abs/2508.04418","url_pdf":"https://arxiv.org/pdf/2508.04418v1","authors":"[\"Jinxing Zhou\",\"Yanghao Zhou\",\"Mingfei Han\",\"Tong Wang\",\"Xiaojun Chang\",\"Hisham Cholakkal\",\"Rao Muhammad Anwer\"]","published":"2025-08-06T13:05:09Z","proceeding":"cs.MM","tasks":"[\"cs.MM\",\"cs.CV\",\"cs.MA\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611236,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2885775,"paper_url":"https://arxiv.org/abs/2508.04418","paper_title":"Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation","repo_url":"https://github.com/jasongief/TGS-Agent","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}