{"ID":2825394,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20978","arxiv_id":"2512.20978","title":"GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model","abstract":"Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.","short_abstract":"Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic to...","url_abs":"https://arxiv.org/abs/2512.20978","url_pdf":"https://arxiv.org/pdf/2512.20978v1","authors":"[\"Haoyang Li\",\"Xuyi Zhuang\",\"Azmat Adnan\",\"Ye Ni\",\"Wei Rao\",\"Shreyas Gopal\",\"Eng Siong Chng\"]","published":"2025-12-24T06:13:02Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
