{"ID":2869305,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.14889","arxiv_id":"2509.14889","title":"CollabVLA: Self-Reflective Vision-Language-Action Model Dreaming Together with Human","abstract":"In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative models, by integrating VLM-based reflective reasoning with diffusion-based action generation under a mixture-of-experts design. Through a two-stage training recipe of action grounding and reflection tuning, it supports explicit self-reflection and proactively solicits human guidance when confronted with uncertainty or repeated failure. It cuts normalized Time by ~2x and Dream counts by ~4x vs. generative agents, achieving higher success rates, improved interpretability, and balanced low latency compared with existing methods. This work takes a pioneering step toward shifting VLAs from opaque controllers to genuinely assistive agents capable of reasoning, acting, and collaborating with humans.","short_abstract":"In this work, we present CollabVLA, a self-reflective vision-language-action framework that transforms a standard visuomotor policy into a collaborative assistant. CollabVLA tackles key limitations of prior VLAs, including domain overfitting, non-interpretable reasoning, and the high latency of auxiliary generative mod...","url_abs":"https://arxiv.org/abs/2509.14889","url_pdf":"https://arxiv.org/pdf/2509.14889v1","authors":"[\"Nan Sun\",\"Yongchang Li\",\"Chenxu Wang\",\"Huiying Li\",\"Huaping Liu\"]","published":"2025-09-18T12:10:41Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Diffusion Model\"]","has_code":false}
