{"ID":2890492,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19225","arxiv_id":"2507.19225","title":"Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation","abstract":"Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and its corresponding speeches. Accordingly, we propose a novel framework, Face2VoiceSync, with several novel contributions: 1) Voice-Face Alignment, ensuring generated voices match facial appearance; 2) Diversity \\\u0026 Manipulation, enabling generated voice control over paralinguistic features space; 3) Efficient Training, using a lightweight VAE to bridge visual and audio large-pretrained models, with significantly fewer trainable parameters than existing methods; 4) New Evaluation Metric, fairly assessing the diversity and identity consistency. Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.","short_abstract":"Recent studies in speech-driven talking face generation achieve promising results, but their reliance on fixed-driven speech limits further applications (e.g., face-voice mismatch). Thus, we extend the task to a more challenging setting: given a face image and text to speak, generating both talking face animation and i...","url_abs":"https://arxiv.org/abs/2507.19225","url_pdf":"https://arxiv.org/pdf/2507.19225v1","authors":"[\"Fang Kang\",\"Yin Cao\",\"Haoyu Chen\"]","published":"2025-07-25T12:49:06Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CV\",\"cs.MM\",\"eess.AS\"]","methods":"[\"Variational Autoencoder\"]","has_code":false}