{"ID":2856318,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11330","arxiv_id":"2510.11330","title":"Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap","abstract":"Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link","short_abstract":"Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into...","url_abs":"https://arxiv.org/abs/2510.11330","url_pdf":"https://arxiv.org/pdf/2510.11330v1","authors":"[\"KiHyun Nam\",\"Jongmin Choi\",\"Hyeongkeun Lee\",\"Jungwoo Heo\",\"Joon Son Chung\"]","published":"2025-10-13T12:25:33Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.CL\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608335,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2856318,"paper_url":"https://arxiv.org/abs/2510.11330","paper_title":"Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap","repo_url":"https://github.com/DevKiHyun/Diffusion-Link","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}