{"ID":2828674,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.14870","arxiv_id":"2512.14870","title":"HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering","abstract":"Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20\\% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.","short_abstract":"Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each q...","url_abs":"https://arxiv.org/abs/2512.14870","url_pdf":"https://arxiv.org/pdf/2512.14870v2","authors":"[\"Dan Ben-Ami\",\"Gabriele Serussi\",\"Kobi Cohen\",\"Chaim Baskin\"]","published":"2025-12-16T19:34:47Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"eess.IV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
