{"ID":2885750,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.04369","arxiv_id":"2508.04369","title":"TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding","abstract":"Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO","short_abstract":"Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs' context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. H...","url_abs":"https://arxiv.org/abs/2508.04369","url_pdf":"https://arxiv.org/pdf/2508.04369v4","authors":"[\"Canhui Tang\",\"Zifan Han\",\"Hongbo Sun\",\"Sanping Zhou\",\"Xuchong Zhang\",\"Xin Wei\",\"Ye Yuan\",\"Huayu Zhang\",\"Jinglin Xu\",\"Hao Sun\"]","published":"2025-08-06T12:03:36Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611232,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2885750,"paper_url":"https://arxiv.org/abs/2508.04369","paper_title":"TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding","repo_url":"https://github.com/Hui-design/TSPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
