{"ID":2827424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16250","arxiv_id":"2512.16250","title":"AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding","abstract":"Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52\\% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.","short_abstract":"Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video unde...","url_abs":"https://arxiv.org/abs/2512.16250","url_pdf":"https://arxiv.org/pdf/2512.16250v1","authors":"[\"Sanjoy Chowdhury\",\"Karren D. Yang\",\"Xudong Liu\",\"Fartash Faghri\",\"Pavan Kumar Anasosalu Vasu\",\"Oncel Tuzel\",\"Dinesh Manocha\",\"Chun-Liang Li\",\"Raviteja Vemulapalli\"]","published":"2025-12-18T07:01:47Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.MA\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
