{"ID":2872693,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.08753","arxiv_id":"2509.08753","title":"Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling","abstract":"We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling","short_abstract":"We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequ...","url_abs":"https://arxiv.org/abs/2509.08753","url_pdf":"https://arxiv.org/pdf/2509.08753v2","authors":"[\"Neil Zeghidour\",\"Eugene Kharitonov\",\"Manu Orsini\",\"Václav Volhejn\",\"Gabriel de Marmiesse\",\"Edouard Grave\",\"Patrick Pérez\",\"Laurent Mazaré\",\"Alexandre Défossez\"]","published":"2025-09-10T16:43:01Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":609992,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2872693,"paper_url":"https://arxiv.org/abs/2509.08753","paper_title":"Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling","repo_url":"https://github.com/kyutai-labs/delayed-streams-modeling","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
