{"ID":2869423,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15085","arxiv_id":"2509.15085","title":"Real-Time Streaming Mel Vocoding with Generative Flow Matching","abstract":"The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.","short_abstract":"The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank,...","url_abs":"https://arxiv.org/abs/2509.15085","url_pdf":"https://arxiv.org/pdf/2509.15085v1","authors":"[\"Simon Welker\",\"Tal Peer\",\"Timo Gerkmann\"]","published":"2025-09-18T15:43:06Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.LG\",\"cs.SD\",\"eess.SP\"]","methods":"[\"Generative Adversarial Network\"]","has_code":false}