{"ID":2895798,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.08761","arxiv_id":"2507.08761","title":"Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data","abstract":"Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.","short_abstract":"Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, whi...","url_abs":"https://arxiv.org/abs/2507.08761","url_pdf":"https://arxiv.org/pdf/2507.08761v2","authors":"[\"Jeonghye Kim\",\"Yongjae Shin\",\"Whiyoung Jung\",\"Sunghoon Hong\",\"Deunsol Yoon\",\"Youngchul Sung\",\"Kanghoon Lee\",\"Woohyung Lim\"]","published":"2025-07-11T17:16:02Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}