{"ID":2867908,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18314","arxiv_id":"2509.18314","title":"Exploiting Tree Structure for Credit Assignment in RL Training of LLMs","abstract":"Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \\textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \\emph{nonparametric} prefix values \\(V(s)\\) by aggregating descendant outcomes. Built on P2T, we propose \\textbf{TEMPO} (\\emph{\\textbf{T}ree-\\textbf{E}stimated \\textbf{M}ean Prefix Value for \\textbf{P}olicy \\textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \\emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.","short_abstract":"Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align wi...","url_abs":"https://arxiv.org/abs/2509.18314","url_pdf":"https://arxiv.org/pdf/2509.18314v2","authors":"[\"Hieu Tran\",\"Zonghai Yao\",\"Hong Yu\"]","published":"2025-09-22T18:37:24Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\"]","has_code":false}
