{"ID":2864951,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22740","arxiv_id":"2509.22740","title":"Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation","abstract":"Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.","short_abstract":"Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only traini...","url_abs":"https://arxiv.org/abs/2509.22740","url_pdf":"https://arxiv.org/pdf/2509.22740v2","authors":"[\"Jinbae Seo\",\"Hyeongjun Kwon\",\"Kwonyoung Kim\",\"Jiyoung Lee\",\"Kwanghoon Sohn\"]","published":"2025-09-26T02:31:17Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.MM\",\"cs.SD\"]","methods":"[]","has_code":false}
