{"ID":2886368,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03457","arxiv_id":"2508.03457","title":"READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation","abstract":"The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference processes of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.","short_abstract":"The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, a real-time diffusion-tr...","url_abs":"https://arxiv.org/abs/2508.03457","url_pdf":"https://arxiv.org/pdf/2508.03457v3","authors":"[\"Haotian Wang\",\"Yuzhe Weng\",\"Jun Du\",\"Haoran Xu\",\"Xiaoyan Wu\",\"Shan He\",\"Bing Yin\",\"Cong Liu\",\"Jianqing Gao\",\"Qingfeng Liu\"]","published":"2025-08-05T13:57:03Z","proceeding":"cs.GR","tasks":"[\"cs.GR\",\"cs.CV\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Variational Autoencoder\"]","has_code":false}
