{"ID":2844888,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.05143","arxiv_id":"2511.05143","title":"Synthesizing speech with selected perceptual voice qualities - A case study with creaky voice","abstract":"The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipu- lated and manipulated speech probes can serve to illustrate pho- netic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attribute manipulation block based on normalizing flows1 , is capable of correctly manipulating the non-persistent, localized quality of creaky voice, thus avoiding the necessity of a, typi- cally unreliable, frame-wise creak predictor. Subjective listen- ing tests confirm successful creak manipulation at a slightly re- duced MOS score compared to the original recording.","short_abstract":"The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipu- lated and manipulated speech probes can serve to illustrate pho- netic concepts that are otherwise difficult to grasp. Here, we show that a TTS system, that is augmented with a global speaker attri...","url_abs":"https://arxiv.org/abs/2511.05143","url_pdf":"https://arxiv.org/pdf/2511.05143v1","authors":"[\"Frederik Rautenberg\",\"Fritz Seebauer\",\"Jana Wiechmann\",\"Michael Kuhlmann\",\"Petra Wagner\",\"Reinhold Haeb-Umbach\"]","published":"2025-11-07T10:44:07Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[]","has_code":false}