{"ID":2866114,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21302","arxiv_id":"2509.21302","title":"Quantized Visual Geometry Grounded Transformer","abstract":"Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\\times$ memory reduction and 2.5$\\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.","short_abstract":"Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common prac...","url_abs":"https://arxiv.org/abs/2509.21302","url_pdf":"https://arxiv.org/pdf/2509.21302v4","authors":"[\"Weilun Feng\",\"Haotong Qin\",\"Mingqiang Wu\",\"Chuanguang Yang\",\"Yuqi Li\",\"Xiangqi Li\",\"Zhulin An\",\"Libo Huang\",\"Yulun Zhang\",\"Michele Magno\",\"Yongjun Xu\"]","published":"2025-09-25T15:17:11Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":609341,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2866114,"paper_url":"https://arxiv.org/abs/2509.21302","paper_title":"Quantized Visual Geometry Grounded Transformer","repo_url":"https://github.com/wlfeng0509/QuantVGGT","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
