{"ID":2869550,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15389","arxiv_id":"2509.15389","title":"Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech Data","abstract":"Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU), focusing on scenarios where text-label pairs are abundant while paired speech-label data are limited. Results show that LALMs already achieve competitive performance with text-only fine-tuning, highlighting their strong generalization ability. Adding even small amounts of speech data (2-5%) yields substantial further gains, with curriculum learning particularly effective under scarce data. In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation. Overall, this study provides practical insights into the LALM fine-tuning under realistic data constraints.","short_abstract":"Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning, especially with limited speech data. To bridge this gap, we systematically examine how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect sp...","url_abs":"https://arxiv.org/abs/2509.15389","url_pdf":"https://arxiv.org/pdf/2509.15389v2","authors":"[\"Youngwon Choi\",\"Jaeyoon Jung\",\"Hyeonyu Kim\",\"Huu-Kim Nguyen\",\"Hwayeon Kim\"]","published":"2025-09-18T19:54:08Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\",\"cs.LG\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}