{"ID":2923576,"CreatedAt":"2026-06-02T04:05:25.881865328Z","UpdatedAt":"2026-06-04T13:12:39.622923895Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02400","arxiv_id":"2606.02400","title":"SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription","abstract":"Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.","short_abstract":"Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances...","url_abs":"https://arxiv.org/abs/2606.02400","url_pdf":"https://arxiv.org/pdf/2606.02400v1","authors":"[\"Yuhang Dai\",\"Haopeng Lin\",\"Zhennan Lin\",\"Jiale Qian\",\"Jun Wu\",\"Hanke Xie\",\"Hao Meng\",\"Hanlin Wen\",\"Chuang Ding\",\"Shunshun Yin\",\"Ming Tao\",\"Lei Xie\",\"Xinsheng Wang\"]","published":"2026-06-01T15:47:01Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
