{"ID":3052322,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-06T05:44:34.749899951Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04474","arxiv_id":"2606.04474","title":"Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention","abstract":"Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.","short_abstract":"Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on lo...","url_abs":"https://arxiv.org/abs/2606.04474","url_pdf":"https://arxiv.org/pdf/2606.04474v1","authors":"[\"Ming-Hao Hsu\",\"Xiaohai Tian\",\"Jun Zhang\",\"Zhizheng Wu\"]","published":"2026-06-03T05:44:09Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"eess.AS\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
