{"ID":2829038,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13495","arxiv_id":"2512.13495","title":"Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation","abstract":"We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation. We construct Soul-1M, containing 1 million finely annotated samples with a precise automated annotation pipeline (covering portrait, upper-body, full-body, and multi-person scenes) to mitigate data scarcity, and we carefully curate Soul-Bench for comprehensive and fair evaluation of audio-/text-guided animation methods. The model is built on the Wan2.2-5B backbone, integrating audio-injection layers and multiple training strategies together with threshold-aware codebook replacement to ensure long-term generation consistency. Meanwhile, step/CFG distillation and a lightweight VAE are used to optimize inference efficiency, achieving an 11.4$\\times$ speedup with negligible quality loss. Extensive experiments show that Soul significantly outperforms current leading open-source and commercial models on video quality, video-text alignment, identity preservation, and lip-synchronization accuracy, demonstrating broad applicability in real-world scenarios such as virtual anchors and film production. Project page at https://zhangzjn.github.io/projects/Soul/","short_abstract":"We propose a multimodal-driven framework for high-fidelity long-term digital human animation termed $\\textbf{Soul}$, which generates semantically coherent videos from a single-frame portrait image, text prompts, and audio, achieving precise lip synchronization, vivid facial expressions, and robust identity preservation...","url_abs":"https://arxiv.org/abs/2512.13495","url_pdf":"https://arxiv.org/pdf/2512.13495v1","authors":"[\"Jiangning Zhang\",\"Junwei Zhu\",\"Zhenye Gan\",\"Donghao Luo\",\"Chuming Lin\",\"Feifan Xu\",\"Xu Peng\",\"Jianlong Hu\",\"Yuansen Liu\",\"Yijia Hong\",\"Weijian Cao\",\"Han Feng\",\"Xu Chen\",\"Chencan Fu\",\"Keke He\",\"Xiaobin Hu\",\"Chengjie Wang\"]","published":"2025-12-15T16:25:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Variational Autoencoder\"]","has_code":false}
