{"ID":2823874,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23927","arxiv_id":"2512.23927","title":"Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration","abstract":"Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identify the stability mechanism that replaces it: local stationary norm alignment. Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy. This operator contracts in the policy's stationary state-action norm, whereas standard fitted regression projects Bellman targets in the behavior norm. This mismatch explains instability under distribution shift. We use this insight to develop stationary-reweighted soft FQI, which reweights each regression step toward the stationary distribution of the current softmax policy. Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. Our results also show that ordinary soft FQI is locally stable under on-policy stationary sampling, even without Bellman completeness, and explain temperature annealing as a continuation strategy for reaching a contraction region.","short_abstract":"Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identi...","url_abs":"https://arxiv.org/abs/2512.23927","url_pdf":"https://arxiv.org/pdf/2512.23927v2","authors":"[\"Lars van der Laan\",\"Nathan Kallus\"]","published":"2025-12-30T00:58:35Z","proceeding":"stat.ML","tasks":"[\"stat.ML\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\"]","has_code":false}
