{"ID":2849265,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24693","arxiv_id":"2510.24693","title":"STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence","abstract":"Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\\% temporal, -35.2\\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.","short_abstract":"Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynami...","url_abs":"https://arxiv.org/abs/2510.24693","url_pdf":"https://arxiv.org/pdf/2510.24693v2","authors":"[\"Zihan Liu\",\"Zhikang Niu\",\"Qiuyang Xiao\",\"Zhisheng Zheng\",\"Ruoqi Yuan\",\"Yuhang Zang\",\"Yuhang Cao\",\"Xiaoyi Dong\",\"Jianze Liang\",\"Xie Chen\",\"Leilei Sun\",\"Dahua Lin\",\"Jiaqi Wang\"]","published":"2025-10-28T17:50:34Z","proceeding":"cs.SD","tasks":"[\"cs.SD\",\"cs.CL\",\"eess.AS\"]","methods":"[\"Language Model\"]","has_code":false}
