{"ID":2844601,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.05936","arxiv_id":"2511.05936","title":"10 Open Challenges Steering the Future of Vision-Language-Action Models","abstract":"Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.","short_abstract":"Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- mult...","url_abs":"https://arxiv.org/abs/2511.05936","url_pdf":"https://arxiv.org/pdf/2511.05936v1","authors":"[\"Soujanya Poria\",\"Navonil Majumder\",\"Chia-Yu Hung\",\"Amir Ali Bagherzadeh\",\"Chuan Li\",\"Kenneth Kwok\",\"Ziwei Wang\",\"Cheston Tan\",\"Jiajun Wu\",\"David Hsu\"]","published":"2025-11-08T09:02:13Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false}
