{"ID":2845722,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.03310","arxiv_id":"2511.03310","title":"TASU: Text-Only Alignment for Speech Understanding","abstract":"Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalization to unseen domains or tasks. To address these limitations, we propose TASU (Text-only Alignment for Speech Understanding), a novel alignment paradigm that can leverage only unpaired text data to guide cross-modal alignment. Experiments show that TASU achieves competitive zero-shot speech recognition. Leveraging this property, it can further function as a pre-training stage in curriculum learning, enhancing domain generalization in speech recognition. Ultimately, TASU can extend its zero-shot generalization to a wide range of speech understanding tasks and notably outperforms prominent Speech LLMs including GLM-4-Voice and Step-Audio on the MMSU benchmark, establishing TASU as an efficient and scalable alignment paradigm for Speech LLMs.","short_abstract":"Recent advances in Speech Large Language Models (Speech LLMs) have paved the way for unified architectures across diverse speech understanding tasks. However, prevailing alignment paradigms rely heavily on large-scale audio-text paired data and computationally intensive training, yet often exhibit limited generalizatio...","url_abs":"https://arxiv.org/abs/2511.03310","url_pdf":"https://arxiv.org/pdf/2511.03310v2","authors":"[\"Jing Peng\",\"Yi Yang\",\"Xu Li\",\"Yu Xi\",\"Quanwei Tang\",\"Yangui Fang\",\"Junjie Li\",\"Kai Yu\"]","published":"2025-11-05T09:24:48Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
