{"ID":2887495,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01178","arxiv_id":"2508.01178","title":"Advancing the Foundation Model for Music Understanding","abstract":"The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.","short_abstract":"The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental...","url_abs":"https://arxiv.org/abs/2508.01178","url_pdf":"https://arxiv.org/pdf/2508.01178v1","authors":"[\"Yi Jiang\",\"Wei Wang\",\"Xianwen Guo\",\"Huiyun Liu\",\"Hanrui Wang\",\"Youri Xu\",\"Haoqi Gu\",\"Zhongqian Xie\",\"Chuanjiang Luo\"]","published":"2025-08-02T03:33:47Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.AI\",\"cs.IR\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}