{"ID":2898238,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.03256","arxiv_id":"2507.03256","title":"MoDA: Multi-modal Diffusion Architecture for Talking Head Generation","abstract":"Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications. Project Page: https://lixinyyang.github.io/MoDA.github.io/","short_abstract":"Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based method...","url_abs":"https://arxiv.org/abs/2507.03256","url_pdf":"https://arxiv.org/pdf/2507.03256v3","authors":"[\"Xinyang Li\",\"Gen Li\",\"Zhihui Lin\",\"Yichen Qian\",\"GongXin Yao\",\"Weinan Jia\",\"Aowen Wang\",\"Weihua Chen\",\"Fan Wang\"]","published":"2025-07-04T02:25:10Z","proceeding":"cs.GR","tasks":"[\"cs.GR\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Variational Autoencoder\"]","has_code":false}