{"ID":2849769,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23569","arxiv_id":"2510.23569","title":"EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT","abstract":"Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.","short_abstract":"Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning b...","url_abs":"https://arxiv.org/abs/2510.23569","url_pdf":"https://arxiv.org/pdf/2510.23569v1","authors":"[\"Baoqi Pei\",\"Yifei Huang\",\"Jilan Xu\",\"Yuping He\",\"Guo Chen\",\"Fei Wu\",\"Yu Qiao\",\"Jiangmiao Pang\"]","published":"2025-10-27T17:38:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607741,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2849769,"paper_url":"https://arxiv.org/abs/2510.23569","paper_title":"EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT","repo_url":"https://github.com/InternRobotics/EgoThinker","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
