{"ID":2853385,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.16437","arxiv_id":"2510.16437","title":"Audio-Visual Speech Enhancement for Spatial Audio - Spatial-VisualVoice and the MAVE Database","abstract":"Audio-visual speech enhancement (AVSE) has been found to be particularly useful at low signal-to-noise (SNR) ratios due to the immunity of the visual features to acoustic noise. However, a significant gap exists in AVSE methods tailored to enhance spatial audio under low-SNR conditions. The latter is of growing interest with augmented reality applications. To address this gap, we present a multi-channel AVSE framework based on VisualVoice that leverages spatial cues from microphone arrays and visual information for enhancing the target speaker in noisy environments. We also introduce MAVe, a novel database containing multi-channel audio-visual signals in controlled, reproducible room conditions across a wide range of SNR levels. Experiments demonstrate that the proposed method consistently achieves significant gains in SI-SDR, STOI, and PESQ, particularly in low SNRs. Binaural signal analysis further confirms the preservation of spatial cues and intelligibility.","short_abstract":"Audio-visual speech enhancement (AVSE) has been found to be particularly useful at low signal-to-noise (SNR) ratios due to the immunity of the visual features to acoustic noise. However, a significant gap exists in AVSE methods tailored to enhance spatial audio under low-SNR conditions. The latter is of growing interes...","url_abs":"https://arxiv.org/abs/2510.16437","url_pdf":"https://arxiv.org/pdf/2510.16437v1","authors":"[\"Danielle Yaffe\",\"Ferdinand Campe\",\"Prachi Sharma\",\"Dorothea Kolossa\",\"Boaz Rafaely\"]","published":"2025-10-18T10:20:12Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[]","has_code":false}
