{"ID":2875323,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.02055","arxiv_id":"2509.02055","title":"Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance","abstract":"Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \\textbf{Align-Then-stEer (\\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \\texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \\textbf{9.8\\%} in simulation and achieves a striking \\textbf{32\\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.","short_abstract":"Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data....","url_abs":"https://arxiv.org/abs/2509.02055","url_pdf":"https://arxiv.org/pdf/2509.02055v2","authors":"[\"Yang Zhang\",\"Chenwei Wang\",\"Ouyang Lu\",\"Yuan Zhao\",\"Yunfei Ge\",\"Zhenglong Sun\",\"Xiu Li\",\"Chi Zhang\",\"Chenjia Bai\",\"Xuelong Li\"]","published":"2025-09-02T07:51:59Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[\"Diffusion Model\"]","has_code":false}
