{"ID":2849083,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24372","arxiv_id":"2510.24372","title":"Bayesian Speech Synthesizers Can Learn from Multiple Teachers","abstract":"Text-to-Speech (TTS) is inherently a \"one-to-many\" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a \"one-to-many\" training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation. Audio samples are available at https://belletts.github.io/Belle/.","short_abstract":"Text-to-Speech (TTS) is inherently a \"one-to-many\" mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, the...","url_abs":"https://arxiv.org/abs/2510.24372","url_pdf":"https://arxiv.org/pdf/2510.24372v3","authors":"[\"Ziyang Zhang\",\"Yifan Gao\",\"Xuenan Xu\",\"Baoxiang Li\",\"Wen Wu\",\"Chao Zhang\"]","published":"2025-10-28T12:49:46Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}
