{"ID":2855260,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13632","arxiv_id":"2510.13632","title":"Closing the Gap Between Text and Speech Understanding in LLMs","abstract":"Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.","short_abstract":"Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performa...","url_abs":"https://arxiv.org/abs/2510.13632","url_pdf":"https://arxiv.org/pdf/2510.13632v2","authors":"[\"Santiago Cuervo\",\"Skyler Seto\",\"Maureen de Seyssel\",\"Richard He Bai\",\"Zijin Gu\",\"Tatiana Likhomanenko\",\"Navdeep Jaitly\",\"Zakaria Aldeneh\"]","published":"2025-10-15T14:57:16Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}