{"ID":2840313,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12908","arxiv_id":"2511.12908","title":"DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning","abstract":"Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-task, multi-sport video understanding. DeepSport shifts from passive frame processing to active, iterative reasoning, dynamically extracting frames to \"think with videos.\" To train our model, we curate a unified 78k-sample dataset via a rigorous three-step text-and-vision distillation pipeline. We then employ a progressive two-stage training strategy: a Sports Curriculum Supervised Fine-Tuning phase to build foundational perception, followed by Agentic Reinforcement Learning with a novel tool-use reward. Extensive experiments on a comprehensive 6.7k benchmark demonstrate that DeepSport achieves state-of-the-art performance, outperforming powerful proprietary and open-source models, while utilizing significantly fewer frames. Furthermore, it exhibits strong zero-shot transferability to unseen sports and broad motion recognition tasks, establishing a highly efficient and generalized foundation for complex video reasoning.","short_abstract":"Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-t...","url_abs":"https://arxiv.org/abs/2511.12908","url_pdf":"https://arxiv.org/pdf/2511.12908v2","authors":"[\"Junbo Zou\",\"Haotian Xia\",\"Zhen Ye\",\"Shengjie Zhang\",\"Christopher Lai\",\"Vicente Ordonez\",\"Weining Shen\",\"Hanjie Chen\"]","published":"2025-11-17T02:57:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}
