{"ID":2830165,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.10403","arxiv_id":"2512.10403","title":"BRACE: A Benchmark for Robust Audio Caption Quality Evaluation","abstract":"Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.","short_abstract":"Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore...","url_abs":"https://arxiv.org/abs/2512.10403","url_pdf":"https://arxiv.org/pdf/2512.10403v1","authors":"[\"Tianyu Guo\",\"Hongyu Chen\",\"Hao Liang\",\"Meiyi Qiang\",\"Bohan Zeng\",\"Linzhuang Sun\",\"Bin Cui\",\"Wentao Zhang\"]","published":"2025-12-11T08:09:24Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}