{"ID":2825799,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.20276","arxiv_id":"2512.20276","title":"ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge","abstract":"Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.","short_abstract":"Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction req...","url_abs":"https://arxiv.org/abs/2512.20276","url_pdf":"https://arxiv.org/pdf/2512.20276v1","authors":"[\"Yuntao Dai\",\"Hang Gu\",\"Teng Wang\",\"Qianyu Cheng\",\"Yifei Zheng\",\"Zhiyong Qiu\",\"Lei Gong\",\"Wenqi Lou\",\"Xuehai Zhou\"]","published":"2025-12-23T11:29:03Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.RO\"]","methods":"[\"Language Model\"]","project_urls":"[\"https://anonymous.4open.science/r/ActionFlow-1D47\"]","has_code":false}
