{"ID":2878964,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17445","arxiv_id":"2508.17445","title":"TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling","abstract":"Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\\% up to 43\\% of the sampling design for the trained models, meanwhile showing up to 40\\% reduction at trajectory-level and 35\\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.","short_abstract":"Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout al...","url_abs":"https://arxiv.org/abs/2508.17445","url_pdf":"https://arxiv.org/pdf/2508.17445v1","authors":"[\"Yizhi Li\",\"Qingshui Gu\",\"Zhoufutu Wen\",\"Ziniu Li\",\"Tianshun Xing\",\"Shuyue Guo\",\"Tianyu Zheng\",\"Xin Zhou\",\"Xingwei Qu\",\"Wangchunshu Zhou\",\"Zheng Zhang\",\"Wei Shen\",\"Qian Liu\",\"Chenghua Lin\",\"Jian Yang\",\"Ge Zhang\",\"Wenhao Huang\"]","published":"2025-08-24T16:52:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Language Model\",\"LoRA\"]","project_urls":"[\"https://m-a-p.ai/TreePO\"]","has_code":false,"code_links":[{"ID":610533,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2878964,"paper_url":"https://arxiv.org/abs/2508.17445","paper_title":"TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling","repo_url":"https://github.com/multimodal-art-projection/TreePO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
