{"ID":2857006,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10051","arxiv_id":"2510.10051","title":"Complementary and Contrastive Learning for Audio-Visual Segmentation","abstract":"Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual interactions through basic operations like padding and multiplications but are restricted by CNNs' limited local receptive field. More recently, Transformer-based methods treat auditory cues as queries, utilizing attention mechanisms to enhance audio-visual cooperation within frames. Nevertheless, they typically struggle to extract multimodal coefficients and temporal dynamics adequately. To overcome these limitations, we present the Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information and capturing spatial-temporal context comprehensively. Our CCFormer initiates with the Early Integration Module (EIM) that employs a parallel bilateral architecture, merging multi-scale visual features with audio data to boost cross-modal complementarity. To extract the intra-frame spatial features and facilitate the perception of temporal coherence, we introduce the Multi-query Transformer Module (MTM), which dynamically endows audio queries with learning capabilities and models the frame and video-level relations simultaneously. Furthermore, we propose the Bi-modal Contrastive Learning (BCL) to promote the alignment across both modalities in the unified feature space. Through the effective combination of those designs, our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets. Our source code and model weights will be made publicly available at https://github.com/SitongGong/CCFormer","short_abstract":"Audio-Visual Segmentation (AVS) aims to generate pixel-wise segmentation maps that correlate with the auditory signals of objects. This field has seen significant progress with numerous CNN and Transformer-based methods enhancing the segmentation accuracy and robustness. Traditional CNN approaches manage audio-visual i...","url_abs":"https://arxiv.org/abs/2510.10051","url_pdf":"https://arxiv.org/pdf/2510.10051v1","authors":"[\"Sitong Gong\",\"Yunzhi Zhuge\",\"Lu Zhang\",\"Pingping Zhang\",\"Huchuan Lu\"]","published":"2025-10-11T06:36:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false,"code_links":[{"ID":608401,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2857006,"paper_url":"https://arxiv.org/abs/2510.10051","paper_title":"Complementary and Contrastive Learning for Audio-Visual Segmentation","repo_url":"https://github.com/SitongGong/CCFormer","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}