{"ID":2858842,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.07134","arxiv_id":"2510.07134","title":"TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking","abstract":"Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.","short_abstract":"Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, ex...","url_abs":"https://arxiv.org/abs/2510.07134","url_pdf":"https://arxiv.org/pdf/2510.07134v1","authors":"[\"Jiahang Liu\",\"Yunpeng Qi\",\"Jiazhao Zhang\",\"Minghan Li\",\"Shaoan Wang\",\"Kui Wu\",\"Hanjing Ye\",\"Hong Zhang\",\"Zhibo Chen\",\"Fangwei Zhong\",\"Zhizheng Zhang\",\"He Wang\"]","published":"2025-10-08T15:29:17Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.CV\"]","methods":"[]","has_code":false}
