{"ID":2890770,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18119","arxiv_id":"2507.18119","title":"GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness","abstract":"Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy that progressively aligns linguistic, paralinguistic, and speaker characteristic information using large-scale speech-text corpora. Experimental results on TELEVAL, a multi-dimensional evaluation benchmark, demonstrate that GOAT-SLM achieves well-balanced performance across both semantic and non-semantic tasks, and outperforms existing open-source models in handling emotion, dialectal variation, and age-sensitive interactions. This work highlights the importance of modeling beyond linguistic content and advances the development of more natural, adaptive, and socially aware spoken language systems.","short_abstract":"Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues emb...","url_abs":"https://arxiv.org/abs/2507.18119","url_pdf":"https://arxiv.org/pdf/2507.18119v2","authors":"[\"Hongjie Chen\",\"Zehan Li\",\"Yaodong Song\",\"Wenming Deng\",\"Yitong Yao\",\"Yuxin Zhang\",\"Hang Lv\",\"Xuechao Zhu\",\"Jian Kang\",\"Jie Lian\",\"Jie Li\",\"Chao Wang\",\"Shuangyong Song\",\"Yongxiang Li\",\"Zhongjiang He\",\"Xuelong Li\"]","published":"2025-07-24T06:10:29Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.SD\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}
