{"ID":2822725,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.02536","arxiv_id":"2601.02536","title":"MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark","abstract":"Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multimodal VideoQA benchmark, MovieRecapsQA, created using movie recap videos -- a distinctive type of YouTube content that summarizes a film via a voiceover description of key clips from the movie (recap video). From the transcribed voiceover (recap summary) of 60 recap videos, we generate $\\approx$8.2K questions along with the necessary ``facts'' expected in each answer; the former facilitates the creation of questions that require mutimodal reasoning and the latter allow the construction of a reference-free evaluation metric that can be applied to open-ended responses. To our knowledge, this is the first reference-free open-ended VideoQA benchmark. The benchmark allows each question to be evaluated in different input video settings: given (a) the full-length movie, (b) the full ($\\approx$11 min) recap video (visual only), (c) $\\approx$14 min of aligned movie scenes, i.e, movie scenes relevant to the question, and (d) $\\approx$1.2 min of aligned recap video scenes. In all cases, the text of any associated movie dialogue is provided. Each question is categorized by the modality required to answer it -- visual, dialogue, or both -- enabling fine-grained evaluation of multimodal capabilities. We benchmark (setting (d)) seven state-of-the-art MLLMs and find that (i) only our reference-free metric produces meaningful human-aligned model separation; (ii) vision-centric questions yield the lowest scores across all models; (iii) removing visual input often \\textit{improves} model factuality; and (iv) the primary bottleneck is visual perception, not visual reasoning.","short_abstract":"Understanding real-world videos such as movies requires integrating visual and dialogue cues. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and, given the difficulty of evaluating free-form answers, largely resort to simple multiple choice questions. We introduce a novel open-ended multi...","url_abs":"https://arxiv.org/abs/2601.02536","url_pdf":"https://arxiv.org/pdf/2601.02536v2","authors":"[\"Shaden Shaar\",\"Bradon Thymes\",\"Sirawut Chaixanien\",\"Claire Cardie\",\"Bharath Hariharan\"]","published":"2026-01-05T20:17:25Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}