{"ID":2876147,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00675","arxiv_id":"2509.00675","title":"Speaker-Conditioned Phrase Break Prediction for Text-to-Speech with Phoneme-Level Pre-trained Language Model","abstract":"This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related characteristics solely from the phrasing task. Besides, we explore the potential of pre-trained speaker embeddings for unseen speakers through a few-shot adaptation method. Furthermore, we pioneer the application of phoneme-level pre-trained language models to this TTS front-end task, which significantly boosts the accuracy of the phrasing model. Our methods are rigorously assessed through both objective and subjective evaluations, demonstrating their effectiveness.","short_abstract":"This paper advances phrase break prediction (also known as phrasing) in multi-speaker text-to-speech (TTS) systems. We integrate speaker-specific features by leveraging speaker embeddings to enhance the performance of the phrasing model. We further demonstrate that these speaker embeddings can capture speaker-related c...","url_abs":"https://arxiv.org/abs/2509.00675","url_pdf":"https://arxiv.org/pdf/2509.00675v1","authors":"[\"Dong Yang\",\"Yuki Saito\",\"Takaaki Saeki\",\"Tomoki Koriyama\",\"Wataru Nakata\",\"Detai Xin\",\"Hiroshi Saruwatari\"]","published":"2025-08-31T03:06:37Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}
