{"ID":2875975,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.01552","arxiv_id":"2509.01552","title":"Variation-aware Vision Token Dropping for Faster Large Vision-Language Models","abstract":"Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency without architectural changes. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which critically hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a dynamic token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\\textit{i.e.}, \\textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks consistently demonstrate that V$^2$Drop maintains \\textbf{94.0\\%} and \\textbf{98.6\\%} of the original performance for image and video understanding tasks respectively, while reducing LLM generation latency by \\textbf{31.5\\%} and \\textbf{74.2\\%}.","short_abstract":"Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, consequently leading to reduced inference efficiency. Token compression offers a di...","url_abs":"https://arxiv.org/abs/2509.01552","url_pdf":"https://arxiv.org/pdf/2509.01552v2","authors":"[\"Junjie Chen\",\"Xuyang Liu\",\"Zichen Wen\",\"Yiyu Wang\",\"Siteng Huang\",\"Honggang Chen\"]","published":"2025-09-01T15:28:44Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
