{"ID":2856244,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.12834","arxiv_id":"2510.12834","title":"Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction","abstract":"Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.","short_abstract":"Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from tex...","url_abs":"https://arxiv.org/abs/2510.12834","url_pdf":"https://arxiv.org/pdf/2510.12834v4","authors":"[\"Téo Guichoux\",\"Théodor Lemerle\",\"Shivam Mehta\",\"Jonas Beskow\",\"Gustav Eje Henter\",\"Laure Soulier\",\"Catherine Pelachaud\",\"Nicolas Obin\"]","published":"2025-10-13T09:51:26Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"eess.AS\"]","methods":"[]","has_code":false}