{"ID":2857499,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.09245","arxiv_id":"2510.09245","title":"SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion","abstract":"Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.","short_abstract":"Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degr...","url_abs":"https://arxiv.org/abs/2510.09245","url_pdf":"https://arxiv.org/pdf/2510.09245v1","authors":"[\"Zhao Guo\",\"Ziqian Ning\",\"Guobin Ma\",\"Lei Xie\"]","published":"2025-10-10T10:33:20Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[]","has_code":false}
