{"ID":2854724,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.14828","arxiv_id":"2510.14828","title":"RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning","abstract":"Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.","short_abstract":"Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challen...","url_abs":"https://arxiv.org/abs/2510.14828","url_pdf":"https://arxiv.org/pdf/2510.14828v2","authors":"[\"Jinrui Liu\",\"Bingyan Nie\",\"Boyu Li\",\"Yaran Chen\",\"Yuze Wang\",\"Shunsen He\",\"Haoran Li\"]","published":"2025-10-16T16:04:35Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.RO\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}