{"ID":2880582,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.13624","arxiv_id":"2508.13624","title":"Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement","abstract":"Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \\textbf{1st place} on the monaural leaderboard.","short_abstract":"Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overco...","url_abs":"https://arxiv.org/abs/2508.13624","url_pdf":"https://arxiv.org/pdf/2508.13624v2","authors":"[\"Rong Chao\",\"Wenze Ren\",\"You-Jin Li\",\"Kuo-Hsuan Hung\",\"Sung-Feng Huang\",\"Szu-Wei Fu\",\"Wen-Huang Cheng\",\"Yu Tsao\"]","published":"2025-08-19T08:34:57Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}