{"ID":2868834,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15969","arxiv_id":"2509.15969","title":"VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency","abstract":"We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a limited look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.","short_abstract":"We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a limited look-ahead that does not delay onset. Built around an increme...","url_abs":"https://arxiv.org/abs/2509.15969","url_pdf":"https://arxiv.org/pdf/2509.15969v2","authors":"[\"Nikita Torgashov\",\"Gustav Eje Henter\",\"Gabriel Skantze\"]","published":"2025-09-19T13:26:46Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.CL\",\"cs.HC\",\"cs.LG\",\"cs.SD\"]","methods":"[\"Transformer\"]","has_code":false}
