{"ID":2827722,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.16978","arxiv_id":"2512.16978","title":"A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos","abstract":"Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.","short_abstract":"Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on singl...","url_abs":"https://arxiv.org/abs/2512.16978","url_pdf":"https://arxiv.org/pdf/2512.16978v1","authors":"[\"Mohammed Irfan Kurpath\",\"Jaseel Muhammad Kaithakkodan\",\"Jinxing Zhou\",\"Sahal Shaji Mullappilly\",\"Mohammad Almansoori\",\"Noor Ahsan\",\"Beknur Kalmakhanbet\",\"Sambal Shikhar\",\"Rishabh Lalla\",\"Jean Lahoud\",\"Mariette Awad\",\"Fahad Shahbaz Khan\",\"Salman Khan\",\"Rao Muhammad Anwer\",\"Hisham Cholakkal\"]","published":"2025-12-18T18:59:27Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":605826,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2827722,"paper_url":"https://arxiv.org/abs/2512.16978","paper_title":"A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos","repo_url":"https://github.com/mbzuai-oryx/longshot","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
