{"ID":2865966,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21060","arxiv_id":"2509.21060","title":"Measuring Audio's Impact on Correctness: Audio-Contribution-Aware Post-Training of Large Audio Language Models","abstract":"Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL) has demonstrated promising results, multi-stage approaches such as supervised fine-tuning (SFT) followed by RL remain suboptimal. The allocation of data across multiple training stages to maximize LALM capabilities has not been fully explored, and large-scale, high-quality datasets for such research are also lacking. To address these problems, we firstly present AudioMCQ, a comprehensive audio multiple-choice question dataset comprising 571k samples with two kinds of chain-of-thought annotations. Secondly, we investigate the prevalent zero audio-contribution phenomenon in LALMs, where models derive correct answers solely from textual information without processing audio content. We propose Audio-Contribution Filtering to partition data into weak and strong audio-contribution subsets. Based on these insights, we develop two effective post-training paradigms: Weak-to-Strong (SFT on weak audio-contribution data followed by RL on strong audio-contribution data) and Mixed-to-Strong (SFT on mixed audio-contribution data followed by RL on strong audio-contribution data). We achieve first place in the DCASE 2025 Audio-Question-Answering challenge by using AudioMCQ. Additionally, leveraging our dataset with different training strategies, we achieve 78.2\\% on MMAU-test-mini, 75.6\\% on MMAU, 67.1\\% on MMAR, and 70.7\\% on MMSU, establishing new state-of-the-art performance.","short_abstract":"Large Audio Language Models (LALMs) represent an important frontier in multimodal AI, addressing diverse audio tasks. Recently, post-training of LALMs has received increasing attention due to significant performance improvements over foundation models. While single-stage post-training such as reinforcement learning (RL...","url_abs":"https://arxiv.org/abs/2509.21060","url_pdf":"https://arxiv.org/pdf/2509.21060v4","authors":"[\"Haolin He\",\"Xingjian Du\",\"Renhe Sun\",\"Zheqi Dai\",\"Yujia Xiao\",\"Mingru Yang\",\"Jiayi Zhou\",\"Xiquan Li\",\"Zhengxi Liu\",\"Zining Liang\",\"Chunyat Wu\",\"Qianhua He\",\"Tan Lee\",\"Xie Chen\",\"Wei-Long Zheng\",\"Weiqiang Wang\",\"Mark Plumbley\",\"Jian Liu\",\"Qiuqiang Kong\"]","published":"2025-09-25T12:11:16Z","proceeding":"eess.AS","tasks":"[\"eess.AS\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
