{"ID":2838908,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.16046","arxiv_id":"2511.16046","title":"Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio","abstract":"Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question \"who spoke what\" in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The model is trained only on short audio under 20s but is capable of streamable inference on long-form audio without additional training. This is achieved by introducing a Speaker Prompt Cache (SPC) with an on-the-fly update mechanism during chunk-wise streaming inference, inspired by the autoregressive nature of LLMs. The SPC also allows the seamless use of pre-enrolled speaker profiles which is common in many scenarios like meeting transcription. To further enhance diarization capability, we incorporate word-level speaker supervision into the speech encoder during training. Experimental results demonstrate that our system outperforms strong baselines, including Sortformer and Meta-Cat in the local setting on audio up to 20s, and DiarizationLM on long-form audio, despite being fully end-to-end and streamable while DiarizationLM follows a cascaded offline pipeline. To the best of our knowledge, this is the first work enabling zero-shot streamable joint ASR and diarization on long audio using a Speech-LLM trained only on short audio, achieving state-of-the-art performance.","short_abstract":"Joint automatic speech recognition (ASR) and speaker diarization aim to answer the question \"who spoke what\" in multi-speaker scenarios. In this paper, we present an end-to-end speech large language model (Speech-LLM) for Joint strEamable DIarization and aSr (JEDIS-LLM). The model is trained only on short audio under 2...","url_abs":"https://arxiv.org/abs/2511.16046","url_pdf":"https://arxiv.org/pdf/2511.16046v1","authors":"[\"Mohan Shi\",\"Xiong Xiao\",\"Ruchao Fan\",\"Shaoshi Ling\",\"Jinyu Li\"]","published":"2025-11-20T05:07:13Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}