{"ID":2841722,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.11298","arxiv_id":"2511.11298","title":"Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation","abstract":"Foundation models applied in robotics, particularly \\textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \\textbf{empirical experiences} from benchmarking four representative VLAs -- \\textbf{ACT}, \\textbf{OpenVLA--OFT}, \\textbf{RDT-1B}, and \\boldmath{$π_0$} -- across four manipulation tasks conducted in both simulation and on the \\textbf{ALOHA Mobile} platform. We establish a \\textbf{standardized evaluation framework} that measures performance along three key dimensions: (1) \\textit{accuracy and efficiency} (success rate and time-to-success), (2) \\textit{adaptability} across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \\textit{language instruction-following accuracy}. Through this process, we observe that \\boldmath{$π_0$} demonstrates superior adaptability in out-of-distribution scenarios, while \\textbf{ACT} provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.","short_abstract":"Foundation models applied in robotics, particularly \\textbf{Vision--Language--Action (VLA)} models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \\textbf{empirical experiences} from benchmarking fou...","url_abs":"https://arxiv.org/abs/2511.11298","url_pdf":"https://arxiv.org/pdf/2511.11298v1","authors":"[\"Yihao Zhang\",\"Yuankai Qi\",\"Xi Zheng\"]","published":"2025-11-14T13:35:30Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[]","has_code":false}
