{"ID":2839954,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.14307","arxiv_id":"2511.14307","title":"Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions","abstract":"In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events, following the Audioset ontology. These segment-level predictions are subsequently calibrated before producing event-level predictions. Finally, these predictions are incorporated into a structured prompt, along with the question and candidate answers. This prompt is then fed to a fine-tuned version of Qwen2.5-7B-Instruct, trained using the GRPO algorithm with a simple reward function. Our method achieves an accuracy of 62.6 % on the development set, demonstrating the effectiveness of combining acoustic event reasoning with instruction-tuned large language models for AQA.","short_abstract":"In this report, we describe our submission to Track 5 of the DCASE 2025 Challenge for the task of Audio Question Answering(AQA). Our system leverages the SSL backbone BEATs to extract frame-level audio features, which are then processed by a classification head to generate segment-level predictions of acoustic events,...","url_abs":"https://arxiv.org/abs/2511.14307","url_pdf":"https://arxiv.org/pdf/2511.14307v1","authors":"[\"Marcel Gibier\",\"Nolwenn Celton\",\"Raphaël Duroselle\",\"Pierre Serrano\",\"Olivier Boeffard\",\"Jean-François Bonastre\"]","published":"2025-11-18T10:05:36Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
