{"ID":2847513,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27280","arxiv_id":"2510.27280","title":"FOCUS: Efficient Keyframe Selection for Long Video Understanding","abstract":"Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.","short_abstract":"Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring usi...","url_abs":"https://arxiv.org/abs/2510.27280","url_pdf":"https://arxiv.org/pdf/2510.27280v2","authors":"[\"Zirui Zhu\",\"Hailun Xu\",\"Yang Luo\",\"Yong Liu\",\"Kanchan Sarkar\",\"Zhenheng Yang\",\"Yang You\"]","published":"2025-10-31T08:41:13Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":607537,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2847513,"paper_url":"https://arxiv.org/abs/2510.27280","paper_title":"FOCUS: Efficient Keyframe Selection for Long Video Understanding","repo_url":"https://github.com/NUS-HPC-AI-Lab/FOCUS","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}