{"ID":2860444,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03584","arxiv_id":"2510.03584","title":"FrameOracle: Learning What to See and How Much to See in Videos","abstract":"Vision-language models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task complexity. To address this, we present FrameOracle, a lightweight, plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained via a curriculum that progresses from weak proxy signals, such as cross-modal similarity, to stronger supervision with FrameOracle-41K, the first large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames per question. Extensive experiments across five VLMs and six benchmarks show that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without accuracy loss. When starting from 64-frame candidates, it reduces inputs to 13.9 frames on average while improving accuracy by 1.5%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.","short_abstract":"Vision-language models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task...","url_abs":"https://arxiv.org/abs/2510.03584","url_pdf":"https://arxiv.org/pdf/2510.03584v2","authors":"[\"Chaoyu Li\",\"Tianzhi Li\",\"Fei Tao\",\"Zhenyu Zhao\",\"Ziqian Wu\",\"Maozheng Zhao\",\"Juntong Song\",\"Cheng Niu\",\"Pooyan Fazli\"]","published":"2025-10-04T00:24:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
