{"ID":2865116,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21990","arxiv_id":"2509.21990","title":"WAVE: Learning Unified \u0026 Versatile Audio-Visual Embeddings with Multimodal LLM","abstract":"While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\\textbf{u}nified \\\u0026 \\textbf{v}ersatile \\textbf{a}udio-\\textbf{v}isual \\textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \\href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.","short_abstract":"While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\\textbf{u}nified \\\u0026 \\textbf{v}ersatile \\textbf{a}udio-\\textbf{v}isual \\textbf{e}mbeddings), the first LLM-based e...","url_abs":"https://arxiv.org/abs/2509.21990","url_pdf":"https://arxiv.org/pdf/2509.21990v2","authors":"[\"Changli Tang\",\"Qinfan Xiao\",\"Ke Mei\",\"Tianyi Wang\",\"Fengyun Rao\",\"Chao Zhang\"]","published":"2025-09-26T07:13:37Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.SD\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609241,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2865116,"paper_url":"https://arxiv.org/abs/2509.21990","paper_title":"WAVE: Learning Unified \u0026 Versatile Audio-Visual Embeddings with Multimodal LLM","repo_url":"https://github.com/TCL606/WAVE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
