{"ID":3049983,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-06T14:39:32.180964103Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04968","arxiv_id":"2606.04968","title":"Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement","abstract":"Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.","short_abstract":"Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cl...","url_abs":"https://arxiv.org/abs/2606.04968","url_pdf":"https://arxiv.org/pdf/2606.04968v1","authors":"[\"Yunpeng Mei\",\"Jiakai He\",\"Hongjie Cao\",\"Chenyu Wang\",\"Xiaowen Zhu\",\"Yihan Zhou\",\"Jiamin Wang\",\"Chenbo Xin\",\"Peng Cheng\",\"Yuxuan Yang\",\"Yijie Wang\",\"Xinhu Zheng\",\"Gao Huang\",\"Jie Chen\",\"Gang Wang\"]","published":"2026-06-03T14:49:35Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}
