{"ID":2895894,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09070","arxiv_id":"2507.09070","title":"SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment","abstract":"Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity with content. We introduce SemAlignVC, an architecture designed to prevent timbre leakage using SemAlign, a novel method that aligns text and audio representations to ensure speaker-independent semantic encoding. This disentangled representation conditions an autoregressive transformer for high-fidelity conversion without explicit speaker embeddings. Experiments show SemAlignVC significantly reduces timbre leakage, outperforming baselines in speaker timbre similarity, intelligibility, and naturalness, making it a robust, privacy-preserving, and generalizable VC solution. Audio samples can be accessed at https://shivammehta25.github.io/SemAlignVC/","short_abstract":"Zero-shot voice conversion (VC) synthesizes speech in a target speaker's voice while preserving linguistic and paralinguistic content. However, timbre leakage-where source speaker traits persist-remains a challenge, especially in neural codec and LLM-based VC, where quantized representations entangle speaker identity w...","url_abs":"https://arxiv.org/abs/2507.09070","url_pdf":"https://arxiv.org/pdf/2507.09070v1","authors":"[\"Shivam Mehta\",\"Yingru Liu\",\"Zhenyu Tang\",\"Kainan Peng\",\"Vimal Manohar\",\"Shun Zhang\",\"Mike Seltzer\",\"Qing He\",\"Mingbo Ma\"]","published":"2025-07-11T23:14:07Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.SD\"]","methods":"[\"Transformer\",\"Large Language Model\"]","has_code":false}