{"ID":2828958,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13313","arxiv_id":"2512.13313","title":"KlingAvatar 2.0 Technical Report","abstract":"Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.","short_abstract":"Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, w...","url_abs":"https://arxiv.org/abs/2512.13313","url_pdf":"https://arxiv.org/pdf/2512.13313v1","authors":"[\"Kling Team\",\"Jialu Chen\",\"Yikang Ding\",\"Zhixue Fang\",\"Kun Gai\",\"Yuan Gao\",\"Kang He\",\"Jingyun Hua\",\"Boyuan Jiang\",\"Mingming Lao\",\"Xiaohan Li\",\"Hui Liu\",\"Jiwen Liu\",\"Xiaoqiang Liu\",\"Yuan Liu\",\"Shun Lu\",\"Yongsen Mao\",\"Yingchao Shao\",\"Huafeng Shi\",\"Xiaoyu Shi\",\"Peiqin Sun\",\"Songlin Tang\",\"Pengfei Wan\",\"Chao Wang\",\"Xuebo Wang\",\"Haoxian Zhang\",\"Yuanxing Zhang\",\"Yan Zhou\"]","published":"2025-12-15T13:30:51Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}