{"ID":2890756,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18100","arxiv_id":"2507.18100","title":"Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning","abstract":"Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.","short_abstract":"Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introd...","url_abs":"https://arxiv.org/abs/2507.18100","url_pdf":"https://arxiv.org/pdf/2507.18100v1","authors":"[\"Ruizhe Chen\",\"Zhiting Fan\",\"Tianze Luo\",\"Heqing Zou\",\"Zhaopeng Feng\",\"Guiyang Xie\",\"Hansheng Zhang\",\"Zhuochen Wang\",\"Zuozhu Liu\",\"Huaijian Zhang\"]","published":"2025-07-24T05:24:01Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
