{"ID":2867108,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18928","arxiv_id":"2509.18928","title":"Direct Preference Optimization for Speech Autoregressive Diffusion Models","abstract":"Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoiding the technical complexities associated with discrete speech tokenization. As a relatively new paradigm, research on reinforcement learning (RL)-based fine-tuning of speech ARDMs remains limited. In this paper, we propose Autoregressive Diffusion-Direct Preference Optimization (ARDM-DPO) to advance this research. By fine-tuning the recently proposed zero-shot text-to-speech model DiTAR with DPO, we achieve significant improvements in terms of speech expressiveness and robustness for long texts.","short_abstract":"Autoregressive diffusion models (ARDMs) have recently been applied to speech generation, achieving state-of-the-art (SOTA) performance in zero-shot text-to-speech. By autoregressively generating continuous speech tokens with next-token diffusion, these models offer a promising alternative to next-token prediction, avoi...","url_abs":"https://arxiv.org/abs/2509.18928","url_pdf":"https://arxiv.org/pdf/2509.18928v1","authors":"[\"Zhijun Liu\",\"Dongya Jia\",\"Xiaoqiang Wang\",\"Chenpeng Du\",\"Shuai Wang\",\"Zhuo Chen\",\"Haizhou Li\"]","published":"2025-09-23T12:47:53Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}