{"ID":2889720,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20987","arxiv_id":"2507.20987","title":"JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1","abstract":"Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.","short_abstract":"Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio...","url_abs":"https://arxiv.org/abs/2507.20987","url_pdf":"https://arxiv.org/pdf/2507.20987v2","authors":"[\"Xinhan Di\",\"Kristin Qi\",\"Pengqian Yu\"]","published":"2025-07-28T16:47:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Diffusion Model\"]","has_code":false,"code_links":[{"ID":611681,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2889720,"paper_url":"https://arxiv.org/abs/2507.20987","paper_title":"JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1","repo_url":"https://github.com/deepreasonings/WholeBodyBenchmark","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}