{"ID":2873317,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.06415","arxiv_id":"2509.06415","title":"Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models","abstract":"Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.","short_abstract":"Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document ima...","url_abs":"https://arxiv.org/abs/2509.06415","url_pdf":"https://arxiv.org/pdf/2509.06415v2","authors":"[\"Jaemin Son\",\"Sujin Choi\",\"Inyong Yun\"]","published":"2025-09-08T08:12:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
