{"ID":2833900,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02541","arxiv_id":"2512.02541","title":"AVGGT: Rethinking Global Attention for Accelerating VGGT","abstract":"Models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves substantial inference acceleration across different context lengths, yielding about $2\\times$ speedup at 100 frames, $4$--$5\\times$ at 300 frames, and $8$--$10\\times$ at 800 frames, while matching or slightly improving the accuracy of the original models and remaining robust in extremely dense multi-view settings where prior sparse-attention baselines fail.","short_abstract":"Models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this...","url_abs":"https://arxiv.org/abs/2512.02541","url_pdf":"https://arxiv.org/pdf/2512.02541v2","authors":"[\"Xianbing Sun\",\"Zhikai Zhu\",\"Zhengyu Lou\",\"Bo Yang\",\"Jinyang Tang\",\"Liqing Zhang\",\"He Wang\",\"Jianfu Zhang\"]","published":"2025-12-02T09:08:18Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
