{"ID":2869437,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15110","arxiv_id":"2509.15110","title":"TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference","abstract":"Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.","short_abstract":"Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward...","url_abs":"https://arxiv.org/abs/2509.15110","url_pdf":"https://arxiv.org/pdf/2509.15110v2","authors":"[\"Dan Zhang\",\"Min Cai\",\"Jonathan Light\",\"Ziniu Hu\",\"Yisong Yue\",\"Jie Tang\"]","published":"2025-09-18T16:14:34Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":609685,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2869437,"paper_url":"https://arxiv.org/abs/2509.15110","paper_title":"TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference","repo_url":"https://github.com/THUDM/TDRM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
