{"ID":2844276,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.06238","arxiv_id":"2511.06238","title":"Temporal-Guided Visual Foundation Models for Event-Based Vision","abstract":"Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained on image data remains under-explored for event-based vision. To address this, we propose Temporal-Guided VFM (TGVFM), a novel framework that integrates VFMs with our temporal context fusion block seamlessly to bridge this gap. Our temporal block introduces three key components: (1) Long-Range Temporal Attention to model global temporal dependencies, (2) Dual Spatiotemporal Attention for multi-scale frame correlation, and (3) Deep Feature Guidance Mechanism to fuse semantic-temporal features. By retraining event-to-video models on real-world data and leveraging transformer-based VFMs, TGVFM preserves spatiotemporal dynamics while harnessing pretrained representations. Experiments demonstrate SoTA performance across semantic segmentation, depth estimation, and object detection, with improvements of 16%, 21%, and 16% over existing methods, respectively. Overall, this work unlocks the cross-modality potential of image-based VFMs for event-based vision with temporal reasoning. Code is available at https://github.com/XiaRho/TGVFM.","short_abstract":"Event cameras offer unique advantages for vision tasks in challenging environments, yet processing asynchronous event streams remains an open challenge. While existing methods rely on specialized architectures or resource-intensive training, the potential of leveraging modern Visual Foundation Models (VFMs) pretrained...","url_abs":"https://arxiv.org/abs/2511.06238","url_pdf":"https://arxiv.org/pdf/2511.06238v1","authors":"[\"Ruihao Xia\",\"Junhong Cai\",\"Luziwei Leng\",\"Liuyi Wang\",\"Chengju Liu\",\"Ran Cheng\",\"Yang Tang\",\"Pan Zhou\"]","published":"2025-11-09T05:45:25Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":607276,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2844276,"paper_url":"https://arxiv.org/abs/2511.06238","paper_title":"Temporal-Guided Visual Foundation Models for Event-Based Vision","repo_url":"https://github.com/XiaRho/TGVFM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}