{"ID":2860364,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04282","arxiv_id":"2510.04282","title":"Flexible and Efficient Spatio-Temporal Transformer for Sequential Visual Place Recognition","abstract":"Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq- length), deliver fast inference, and have low memory usage to meet real-time constraints. However, existing approaches prioritize performance at the expense of flexibility and effi- ciency. To address this gap, we propose Adapt-STformer, a Seq-VPR method built around our novel Recurrent Deformable Transformer Encoder (Recurrent-DTE), which uses an iterative recurrent mechanism to fuse information from multiple sequen- tial frames. This design naturally supports variable seq-lengths, fast inference, and low memory usage. Experiments on the Nordland, Oxford, and NuScenes datasets show that Adapt- STformer boosts recall by up to 17% while reducing sequence extraction time by 36% and lowering memory usage by 35% relative to our best comparable baseline. Our code is released at https://ai4ce.github.io/Adapt-STFormer/.","short_abstract":"Sequential Visual Place Recognition (Seq-VPR) leverages transformers to capture spatio-temporal features effectively. In practice, a transformer-based Seq-VPR model should be flexible to the number of frames per sequence (seq- length), deliver fast inference, and have low memory usage to meet real-time constraints. How...","url_abs":"https://arxiv.org/abs/2510.04282","url_pdf":"https://arxiv.org/pdf/2510.04282v2","authors":"[\"Yu Kiu\",\"Lau\",\"Chao Chen\",\"Ge Jin\",\"Chen Feng\"]","published":"2025-10-05T16:52:12Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false}