{"ID":2831947,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.06689","arxiv_id":"2512.06689","title":"Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation","abstract":"Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS within multi-stage architectures, these approaches typically involve complex, parameter-heavy models and rely on supervised training, limiting scalability and generalization. In this work, we propose UniVoiceLite, a lightweight and unsupervised audio-visual framework that unifies SE and SS within a single model. UniVoiceLite leverages lip motion and facial identity cues to guide speech extraction and employs Wasserstein distance regularization to stabilize the latent space without requiring paired noisy-clean data. Experimental results demonstrate that UniVoiceLite achieves strong performance in both noisy and multi-speaker scenarios, combining efficiency with robust generalization. The source code is available at https://github.com/jisoo-o/UniVoiceLite.","short_abstract":"Speech Enhancement (SE) and Speech Separation (SS) have traditionally been treated as distinct tasks in speech processing. However, real-world audio often involves both background noise and overlapping speakers, motivating the need for a unified solution. While recent approaches have attempted to integrate SE and SS wi...","url_abs":"https://arxiv.org/abs/2512.06689","url_pdf":"https://arxiv.org/pdf/2512.06689v1","authors":"[\"Jisoo Park\",\"Seonghak Lee\",\"Guisik Kim\",\"Taewoo Kim\",\"Junseok Kwon\"]","published":"2025-12-07T06:48:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"eess.AS\"]","methods":"[]","has_code":false,"code_links":[{"ID":606178,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831947,"paper_url":"https://arxiv.org/abs/2512.06689","paper_title":"Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation","repo_url":"https://github.com/jisoo-o/UniVoiceLite","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}