{"ID":2870997,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.12052","arxiv_id":"2509.12052","title":"FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling","abstract":"Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory variation induced by stochastic initialization leaves residual inter-frame inconsistencies, which manifest as short-term, abrupt visual fluctuations between adjacent frames. To further verify this, we conduct a controlled study by fixing the input while varying only the random seed. The results show markedly different flicker patterns across samplings, with a mean inter-seed Pearson correlation of only r = 0.15. This motivates us to explore autoregressive generation, which models frames sequentially and provides a more direct prior for temporal continuity. Based on this, we propose FluentAvatar, a two-stage autoregressive framework built on phoneme representations. First, Facial Keyframe Generation produces phoneme-aligned keyframes under a Phoneme-Frame Causal Attention Mask, and Inter-frame Interpolation synthesizes transition frames via a timestamp-aware adaptive strategy built upon selective state space modeling. Moreover, we introduce BG-Flicker, a background-isolated metric for talking-head videos that enables more reliable evaluation of inter-frame flicker. Experiments on CMLR and HDTF demonstrate that FluentAvatar achieves strong performance in visual fidelity, lip synchronization, and temporal stability, attaining the best FVD on both datasets and BG-Flicker results close to ground truth. The code, the model, and the interface will be released to facilitate further research.","short_abstract":"Current talking-head generation has gradually shifted from GAN-based methods to diffusion-based paradigms, achieving remarkable progress in visual fidelity and temporal consistency. However, inter-frame flicker remains prevalent in existing diffusion-based methods. An important reason is that denoising trajectory varia...","url_abs":"https://arxiv.org/abs/2509.12052","url_pdf":"https://arxiv.org/pdf/2509.12052v3","authors":"[\"Yuchen Deng\",\"Xiuyang Wu\",\"Hai-Tao Zheng\",\"Suiyang Zhang\",\"Yi He\",\"Yuxing Han\"]","published":"2025-09-15T15:34:02Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Generative Adversarial Network\"]","has_code":false}