{"ID":2832461,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.05513","arxiv_id":"2512.05513","title":"Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning","abstract":"Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K high-quality human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounded reasoning through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (e.g., Qwen, VideoR1, Gemini, and GPT-4o) reveal that existing models struggle to \"show what they know\" and vice versa. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We have released the dataset at https://github.com/LUNAProject22/Know-Show, and the code will be released in the same repository.","short_abstract":"Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their se...","url_abs":"https://arxiv.org/abs/2512.05513","url_pdf":"https://arxiv.org/pdf/2512.05513v3","authors":"[\"Chinthani Sugandhika\",\"Chen Li\",\"Deepu Rajan\",\"Basura Fernando\"]","published":"2025-12-05T08:15:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":606231,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2832461,"paper_url":"https://arxiv.org/abs/2512.05513","paper_title":"Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning","repo_url":"https://github.com/LUNAProject22/Know-Show","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
