{"ID":3049995,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T14:23:19.411228982Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04939","arxiv_id":"2606.04939","title":"UAT: Unified Audio-Text Diffusion for Audio Generation, Editing, and Captioning","abstract":"Audio generation and audio-to-text understanding remain largely separate, with diffusion models dominating high-fidelity synthesis and autoregressive (AR) language models driving captioning and semantic prediction. Existing unified approaches typically rely on either heterogeneous modules or AR-centric modeling, which can hinder joint optimization and limit acoustic fidelity. We present UAT, to our knowledge, the first diffusion-centric framework that supports unified audio generation, editing, and captioning. UAT couples continuous latent diffusion for audio with masked discrete diffusion for text, enabling bidirectional audio-text modeling within a shared dual-stream backbone. Experiments show that UAT preserves strong audio generation and editing capabilities while achieving competitive captioning performance, demonstrating a favorable balance between acoustic synthesis and semantic prediction. Demo samples are available at https://UAT-demo.github.io.","short_abstract":"Audio generation and audio-to-text understanding remain largely separate, with diffusion models dominating high-fidelity synthesis and autoregressive (AR) language models driving captioning and semantic prediction. Existing unified approaches typically rely on either heterogeneous modules or AR-centric modeling, which...","url_abs":"https://arxiv.org/abs/2606.04939","url_pdf":"https://arxiv.org/pdf/2606.04939v1","authors":"[\"Hui Wang\",\"Yifan Yang\",\"Zeyue Tian\",\"Yuhang Jia\",\"Jinghua Zhao\",\"Long Zhou\",\"Bing Han\",\"Cheng Liu\",\"Jiaming Zhou\",\"Geng Tu\",\"Yong Qin\"]","published":"2026-06-03T14:29:52Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false}