{"ID":2860068,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.05295","arxiv_id":"2510.05295","title":"AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement","abstract":"In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing a U-Net-based 1D convolutional encoder for audio and a Swin Transformer V2 for efficient and expressive visual feature extraction. Central to the architecture is a novel bidirectional cross-attention mechanism, which facilitates deep contextual fusion between modalities, enabling rich and complementary representation learning. To capture temporal dependencies within the fused embeddings, a stack of lightweight Squeezeformer blocks combining convolutional and attention modules is introduced. The enhanced embeddings are then decoded via a U-Net-style decoder for direct waveform reconstruction, ensuring perceptually consistent and intelligible speech output. Experimental evaluations demonstrate the effectiveness of AUREXA-SE, achieving significant performance improvements over noisy baselines, with STOI of 0.516, PESQ of 1.323, and SI-SDR of -4.322 dB. The source code of AUREXA-SE is available at https://github.com/mtanveer1/AVSEC-4-Challenge-2025.","short_abstract":"In this paper, we propose AUREXA-SE (Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement), a progressive bimodal framework tailored for audio-visual speech enhancement (AVSE). AUREXA-SE jointly leverages raw audio waveforms and visual cues by employing...","url_abs":"https://arxiv.org/abs/2510.05295","url_pdf":"https://arxiv.org/pdf/2510.05295v1","authors":"[\"M. Sajid\",\"Deepanshu Gupta\",\"Yash Modi\",\"Sanskriti Jain\",\"Harshith Jai Surya Ganji\",\"A. Rahaman\",\"Harshvardhan Choudhary\",\"Nasir Saleem\",\"Amir Hussain\",\"M. Tanveer\"]","published":"2025-10-06T19:05:35Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.MM\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":608696,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2860068,"paper_url":"https://arxiv.org/abs/2510.05295","paper_title":"AUREXA-SE: Audio-Visual Unified Representation Exchange Architecture with Cross-Attention and Squeezeformer for Speech Enhancement","repo_url":"https://github.com/mtanveer1/AVSEC-4-Challenge-2025","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}