{"ID":2828637,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14657","arxiv_id":"2512.14657","title":"Adapting Speech Language Model to Singing Voice Synthesis","abstract":"Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In this work, we adapt a 1.7B parameter TTS pretrained SLM for singing voice synthesis (SVS), using only a 135-hour synthetic singing corpus, ACE-Opencpop. Building upon the ESPNet-SpeechLM, our recipe involves the following procedure: (1) tokenization of music score conditions and singing waveforms, (2) multi-stream language model token prediction, (3) conditional flow matching-based mel-spectrogram generation. (4) a mel-to-wave vocoder. Experimental results demonstrate that our adapted SLM generalizes well to SVS and achieves performance comparable to leading discrete token-based SVS models.","short_abstract":"Speech Language Models (SLMs) have recently emerged as a unified paradigm for addressing a wide range of speech-related tasks, including text-to-speech (TTS), speech enhancement (SE), and automatic speech recognition (ASR). However, the generalization capability of large-scale pre-trained SLMs remains underexplored. In...","url_abs":"https://arxiv.org/abs/2512.14657","url_pdf":"https://arxiv.org/pdf/2512.14657v1","authors":"[\"Yiwen Zhao\",\"Jiatong Shi\",\"Jinchuan Tian\",\"Yuxun Tang\",\"Jiarui Hai\",\"Jionghao Han\",\"Shinji Watanabe\"]","published":"2025-12-16T18:17:34Z","proceeding":"cs.SD","tasks":"[\"cs.SD\"]","methods":"[\"Language Model\"]","has_code":false}