{"ID":2882231,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.10456","arxiv_id":"2508.10456","title":"Exploring Cross-Utterance Speech Contexts for Conformer-Transducer Speech Recognition Systems","abstract":"This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv) a novel chunk-based approach applied to C-T models for the first time. An efficient batch-training scheme is proposed for contextual C-Ts that uses spliced speech utterances within each minibatch to minimize the synchronization overhead while preserving the sequential order of cross-utterance speech contexts. Experiments are conducted on four benchmark speech datasets across three languages: the English GigaSpeech and Mandarin Wenetspeech corpora used in contextual C-T models pre-training; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets used in domain fine-tuning. The best performing contextual C-T systems consistently outperform their respective baselines using no cross-utterance speech contexts in pre-training and fine-tuning stages with statistically significant average word error rate (WER) or character error rate (CER) reductions up to 0.9%, 1.1%, 0.51%, and 0.98% absolute (6.0%, 5.4%, 2.0%, and 3.4% relative) on the four tasks respectively. Their performance competitiveness against Wav2vec2.0-Conformer, XLSR-128, and Whisper models highlights the potential benefit of incorporating cross-utterance speech contexts into current speech foundation models.","short_abstract":"This paper investigates four types of cross-utterance speech contexts modeling approaches for streaming and non-streaming Conformer-Transformer (C-T) ASR systems: i) input audio feature concatenation; ii) cross-utterance Encoder embedding concatenation; iii) cross-utterance Encoder embedding pooling projection; or iv)...","url_abs":"https://arxiv.org/abs/2508.10456","url_pdf":"https://arxiv.org/pdf/2508.10456v1","authors":"[\"Mingyu Cui\",\"Mengzhe Geng\",\"Jiajun Deng\",\"Chengxi Deng\",\"Jiawen Kang\",\"Shujie Hu\",\"Guinan Li\",\"Tianzi Wang\",\"Zhaoqing Li\",\"Xie Chen\",\"Xunying Liu\"]","published":"2025-08-14T08:54:01Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Transformer\"]","has_code":false}
