{"ID":3004947,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03116","arxiv_id":"2606.03116","title":"AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following","abstract":"The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.","short_abstract":"The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fa...","url_abs":"https://arxiv.org/abs/2606.03116","url_pdf":"https://arxiv.org/pdf/2606.03116v1","authors":"[\"Haitao Li\",\"Tian Tan\",\"Yuguang Yang\",\"Shan Yang\",\"Xie Chen\"]","published":"2026-06-02T04:00:32Z","proceeding":"eess.AS","tasks":"[\"eess.AS\",\"cs.AI\",\"cs.SD\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}