{"ID":2880580,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.13618","arxiv_id":"2508.13618","title":"TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis","abstract":"Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced across key demographic and linguistic axes. Our experiments demonstrate that a model trained on TalkVid outperforms counterparts trained on previous datasets, exhibiting superior cross-dataset generalization. Crucially, our analysis on TalkVid-Bench reveals performance disparities across subgroups that are obscured by traditional aggregate metrics, underscoring its necessity for future research. Code and data can be found in https://github.com/FreedomIntelligence/TalkVid","short_abstract":"Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in ex...","url_abs":"https://arxiv.org/abs/2508.13618","url_pdf":"https://arxiv.org/pdf/2508.13618v1","authors":"[\"Shunian Chen\",\"Hejin Huang\",\"Yexin Liu\",\"Zihan Ye\",\"Pengcheng Chen\",\"Chenghao Zhu\",\"Michael Guan\",\"Rongsheng Wang\",\"Junying Chen\",\"Guanbin Li\",\"Ser-Nam Lim\",\"Harry Yang\",\"Benyou Wang\"]","published":"2025-08-19T08:31:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":610688,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2880580,"paper_url":"https://arxiv.org/abs/2508.13618","paper_title":"TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis","repo_url":"https://github.com/FreedomIntelligence/TalkVid","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}