{"ID":2869931,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14097","arxiv_id":"2509.14097","title":"Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing","abstract":"Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.","short_abstract":"Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal a...","url_abs":"https://arxiv.org/abs/2509.14097","url_pdf":"https://arxiv.org/pdf/2509.14097v1","authors":"[\"Yaru Chen\",\"Ruohao Guo\",\"Liting Gao\",\"Yang Xiang\",\"Qingyu Luo\",\"Zhenbo Li\",\"Wenwu Wang\"]","published":"2025-09-17T15:38:05Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\"]","methods":"[]","has_code":false}