{"ID":2868615,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15612","arxiv_id":"2509.15612","title":"Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition","abstract":"Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results show a significant improvement of TS-ASR performance with CoT and RL training, which demonstrates the effectiveness of the proposed CoT and RL training methods adapted for the TS-ASR task.","short_abstract":"Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization...","url_abs":"https://arxiv.org/abs/2509.15612","url_pdf":"https://arxiv.org/pdf/2509.15612v2","authors":"[\"Yiru Zhang\",\"Hang Su\",\"Lichun Fan\",\"Zhenbo Luo\",\"Jian Luan\"]","published":"2025-09-19T05:22:09Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"eess.AS\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
