{"ID":2858308,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08480","arxiv_id":"2510.08480","title":"Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools","abstract":"Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.","short_abstract":"Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework tha...","url_abs":"https://arxiv.org/abs/2510.08480","url_pdf":"https://arxiv.org/pdf/2510.08480v2","authors":"[\"Zhenlong Yuan\",\"Xiangyan Qu\",\"Chengxuan Qian\",\"Rui Chen\",\"Jing Tang\",\"Lei Sun\",\"Xiangxiang Chu\",\"Dapeng Zhang\",\"Yiwei Wang\",\"Yujun Cai\",\"Shuo Li\"]","published":"2025-10-09T17:20:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}