{"ID":2879348,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16188","arxiv_id":"2508.16188","title":"Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation","abstract":"We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.","short_abstract":"We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine...","url_abs":"https://arxiv.org/abs/2508.16188","url_pdf":"https://arxiv.org/pdf/2508.16188v2","authors":"[\"Weiting Tan\",\"Jiachen Lian\",\"Hirofumi Inaguma\",\"Paden Tomasello\",\"Philipp Koehn\",\"Xutai Ma\"]","published":"2025-08-22T08:08:45Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CV\",\"cs.MM\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}