{"ID":2886080,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.03018","arxiv_id":"2508.03018","title":"Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning","abstract":"Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.","short_abstract":"Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-rew...","url_abs":"https://arxiv.org/abs/2508.03018","url_pdf":"https://arxiv.org/pdf/2508.03018v2","authors":"[\"Yutong Wang\",\"Pengliang Ji\",\"Kaixin Li\",\"Baolong Bi\",\"Tao Feng\",\"Guillaume Sartoretti\"]","published":"2025-08-05T02:56:58Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.RO\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}