{"ID":2875422,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.02278","arxiv_id":"2509.02278","title":"Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation","abstract":"Singing-driven 3D head animation is a challenging yet promising task with applications in virtual avatars, entertainment, and education. Unlike speech, singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics, requiring the synthesis of fine-grained, temporally coherent facial motion. Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results, which are insufficient for singing animation. To address this, we propose Think2Sing, a diffusion-based framework that leverages pretrained large language models to generate semantically coherent and temporally consistent 3D head animations, conditioned on both lyrics and acoustics. A key innovation is the introduction of motion subtitles, an auxiliary semantic representation derived through a novel Singing Chain-of-Thought reasoning process combined with acoustic-guided retrieval. These subtitles contain precise timestamps and region-specific motion descriptions, serving as interpretable motion priors. We frame the task as a motion intensity prediction problem, enabling finer control over facial regions and improving the modeling of expressive motion. To support this, we create a multimodal singing dataset with synchronized video, acoustic descriptors, and motion subtitles, enabling diverse and expressive motion learning. Extensive experiments show that Think2Sing outperforms state-of-the-art methods in realism, expressiveness, and emotional fidelity, while also offering flexible, user-controllable animation editing.","short_abstract":"Singing-driven 3D head animation is a challenging yet promising task with applications in virtual avatars, entertainment, and education. Unlike speech, singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics, requiring the synthesis of fine-grained, temporally coherent facial motion. Existi...","url_abs":"https://arxiv.org/abs/2509.02278","url_pdf":"https://arxiv.org/pdf/2509.02278v1","authors":"[\"Zikai Huang\",\"Yihan Zhou\",\"Xuemiao Xu\",\"Cheng Xu\",\"Xiaofen Xing\",\"Jing Qin\",\"Shengfeng He\"]","published":"2025-09-02T12:59:27Z","proceeding":"cs.GR","tasks":"[\"cs.GR\",\"cs.AI\",\"cs.MM\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false}
