{"ID":2830857,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09927","arxiv_id":"2512.09927","title":"Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models","abstract":"Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining semantic coherence. By coupling expansion and merging within a single feed-forward pass, TEAM-VLA achieves a balanced trade-off between efficiency and effectiveness, without any retraining or parameter updates. Extensive experiments on LIBERO benchmark demonstrate that TEAM-VLA consistently improves inference speed while maintaining or even surpassing the task success rate of full VLA models. The code is public available on \\href{https://github.com/Jasper-aaa/TEAM-VLA}{https://github.com/Jasper-aaa/TEAM-VLA}","short_abstract":"Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and...","url_abs":"https://arxiv.org/abs/2512.09927","url_pdf":"https://arxiv.org/pdf/2512.09927v1","authors":"[\"Yifan Ye\",\"Jiaqi Ma\",\"Jun Cen\",\"Zhihe Lu\"]","published":"2025-12-10T18:59:24Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false,"code_links":[{"ID":606080,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2830857,"paper_url":"https://arxiv.org/abs/2512.09927","paper_title":"Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models","repo_url":"https://github.com/Jasper-aaa/TEAM-VLA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
