{"ID":2889886,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20177","arxiv_id":"2507.20177","title":"Towards Universal Modal Tracking with Online Dense Temporal Token Learning","abstract":"We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \\textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \\textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \\textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\\modaltracker} achieves a new \\textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.","short_abstract":"We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model i...","url_abs":"https://arxiv.org/abs/2507.20177","url_pdf":"https://arxiv.org/pdf/2507.20177v1","authors":"[\"Yaozong Zheng\",\"Bineng Zhong\",\"Qihua Liang\",\"Shengping Zhang\",\"Guorong Li\",\"Xianxian Li\",\"Rongrong Ji\"]","published":"2025-07-27T08:47:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.MM\"]","methods":"[]","has_code":false,"code_links":[{"ID":611704,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2889886,"paper_url":"https://arxiv.org/abs/2507.20177","paper_title":"Towards Universal Modal Tracking with Online Dense Temporal Token Learning","repo_url":"https://github.com/GXNU-ZhongLab/ODTrack","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
