{"ID":2855073,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13308","arxiv_id":"2510.13308","title":"Towards Multimodal Query-Based Spatial Audio Source Extraction","abstract":"Query-based audio source extraction seeks to recover a target source from a mixture conditioned on a query. Existing approaches are largely confined to single-channel audio, leaving the spatial information in multi-channel recordings underexploited. We introduce a query-based spatial audio source extraction framework for recovering dry target signals from first-order ambisonics (FOA) mixtures. Our method accepts either an audio prompt or a text prompt as condition input, enabling flexible end-to-end extraction. The core of our proposed model lies in a tri-axial Transformer that jointly models temporal, frequency, and spatial channel dependencies. The model uses contrastive language-audio pretraining (CLAP) embeddings to enable unified audio-text conditioning via feature-wise linear modulation (FiLM). To eliminate costly annotations and improve generalization, we propose a label-free data pipeline that dynamically generates spatial mixtures and corresponding targets for training. The result of our experiment with high separation quality demonstrates the efficacy of multimodal conditioning and tri-axial modeling. This work establishes a new paradigm for high-fidelity spatial audio separation in immersive applications.","short_abstract":"Query-based audio source extraction seeks to recover a target source from a mixture conditioned on a query. Existing approaches are largely confined to single-channel audio, leaving the spatial information in multi-channel recordings underexploited. We introduce a query-based spatial audio source extraction framework f...","url_abs":"https://arxiv.org/abs/2510.13308","url_pdf":"https://arxiv.org/pdf/2510.13308v1","authors":"[\"Chenxin Yu\",\"Hao Ma\",\"Xu Li\",\"Xiao-Lei Zhang\",\"Mingjie Shao\",\"Chi Zhang\",\"Xuelong Li\"]","published":"2025-10-15T08:55:23Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false}