{"ID":2831306,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08829","arxiv_id":"2512.08829","title":"InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models","abstract":"Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \\textbf{InfiniteVL}. We first develop a hybrid base model called \\textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \\textbf{1.7$\\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \\textbf{InfiniteVL-Offline} for offline retrieval and \\textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \\textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \\textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.","short_abstract":"Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \\textbf{InfiniteVL}....","url_abs":"https://arxiv.org/abs/2512.08829","url_pdf":"https://arxiv.org/pdf/2512.08829v2","authors":"[\"Hongyuan Tao\",\"Bencheng Liao\",\"Shaoyu Chen\",\"Haoran Yin\",\"Qian Zhang\",\"Wenyu Liu\",\"Xinggang Wang\"]","published":"2025-12-09T17:18:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false,"code_links":[{"ID":606113,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831306,"paper_url":"https://arxiv.org/abs/2512.08829","paper_title":"InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models","repo_url":"https://github.com/hustvl/InfiniteVL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
