{"ID":2859484,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.06149","arxiv_id":"2510.06149","title":"Implicit Updates for Average-Reward Temporal Difference Learning","abstract":"Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($λ$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($λ$), which employs an implicit fixed point update to provide data-adaptive stabilization while preserving the per iteration computational complexity of standard average-reward TD($λ$). In contrast to prior finite-time analyses of average-reward TD($λ$), which impose restrictive step-size conditions, we establish finite-time error bounds for the implicit variant under substantially weaker step-size requirements. Empirically, average-reward implicit TD($λ$) operates reliably over a much broader range of step-sizes and exhibits markedly improved numerical stability. This enables more efficient policy evaluation and policy learning, highlighting its effectiveness as a robust alternative to average-reward TD($λ$).","short_abstract":"Temporal difference (TD) learning is a cornerstone of reinforcement learning. In the average-reward setting, standard TD($λ$) is highly sensitive to the choice of step-size and thus requires careful tuning to maintain numerical stability. We introduce average-reward implicit TD($λ$), which employs an implicit fixed poi...","url_abs":"https://arxiv.org/abs/2510.06149","url_pdf":"https://arxiv.org/pdf/2510.06149v1","authors":"[\"Hwanwoo Kim\",\"Dongkyu Derek Cho\",\"Eric Laber\"]","published":"2025-10-07T17:19:39Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}