{"ID":2859509,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06195","arxiv_id":"2510.06195","title":"Latent Speech-Text Transformer","abstract":"Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalance disproportionately allocates pre-training and inference compute to speech, potentially hindering effective cross-modal alignment and slowing performance scaling by orders of magnitude. We introduce the Latent Speech-Text Transformer (LST), which aggregates speech tokens into latent speech patches that serve as higher-level autoregressive units. This design aligns the sequence-modeling granularity between speech and text while improving computational efficiency. The resulting patches can align with textual units to facilitate cross-modal knowledge transfer and compactly capture recurring acoustic patterns such as silence. Across story-completion benchmarks under both compute-controlled and data-controlled settings, LST consistently improves speech accuracy while also improving text performance, achieving up to +6.5% absolute gain on speech HellaSwag in compute-controlled training (+5.3% in data-controlled training). Under compute-controlled scaling from 420M to 1.8B parameters in a near compute-optimal regime, gains grow with scale, and improvements persist up to 7B parameters under fixed-token budgets. These benefits extend to downstream tasks: LST stabilizes ASR adaptation and reduces the effective autoregressive sequence length during ASR and TTS inference, lowering computational cost without degrading reconstruction quality. The code is available at https://github.com/facebookresearch/lst.","short_abstract":"Auto-regressive speech-text models pre-trained on interleaved text tokens and discretized speech tokens demonstrate strong speech understanding and generation, yet remain substantially less compute-efficient than text LLMs, partly due to the much longer sequences of speech tokens relative to text. This modality imbalan...","url_abs":"https://arxiv.org/abs/2510.06195","url_pdf":"https://arxiv.org/pdf/2510.06195v2","authors":"[\"Yen-Ju Lu\",\"Yashesh Gaur\",\"Wei Zhou\",\"Benjamin Muller\",\"Jesus Villalba\",\"Najim Dehak\",\"Luke Zettlemoyer\",\"Gargi Ghosh\",\"Mike Lewis\",\"Srinivasan Iyer\",\"Duc Le\"]","published":"2025-10-07T17:52:08Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Transformer\",\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608647,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2859509,"paper_url":"https://arxiv.org/abs/2510.06195","paper_title":"Latent Speech-Text Transformer","repo_url":"https://github.com/facebookresearch/lst","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
