{"ID":2886290,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03337","arxiv_id":"2508.03337","title":"Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration","abstract":"The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selectors produce prompts suffering from significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning(AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and Video-MME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 82.2%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.","short_abstract":"The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selector...","url_abs":"https://arxiv.org/abs/2508.03337","url_pdf":"https://arxiv.org/pdf/2508.03337v8","authors":"[\"Shaoguang Wang\",\"Weiyu Guo\",\"Ziyang Chen\",\"Yijie Xu\",\"Xuming Hu\",\"Hui Xiong\"]","published":"2025-08-05T11:31:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
