{"ID":2822660,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.02281","arxiv_id":"2601.02281","title":"InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams","abstract":"The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT","short_abstract":"The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming archi...","url_abs":"https://arxiv.org/abs/2601.02281","url_pdf":"https://arxiv.org/pdf/2601.02281v1","authors":"[\"Shuai Yuan\",\"Yantai Yang\",\"Xiaotian Yang\",\"Xupeng Zhang\",\"Zhonghao Zhao\",\"Lingming Zhang\",\"Zhipeng Zhang\"]","published":"2026-01-05T17:11:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":605433,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2822660,"paper_url":"https://arxiv.org/abs/2601.02281","paper_title":"InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams","repo_url":"https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
