{"ID":2865655,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22964","arxiv_id":"2509.22964","title":"Functional Critics Are Essential for Actor-Critic: From Off-Policy Stability to Efficient Exploration","abstract":"The actor-critic (AC) framework has achieved strong empirical success in off-policy reinforcement learning but suffers from the \"moving target\" problem, where the evaluated policy changes continually. Functional critics, or policy-conditioned value functions, address this by explicitly including a representation of the policy as input. While conceptually appealing, previous efforts have struggled to remain competitive against standard AC. In this work, we revisit functional critics within the actor-critic framework and identify two critical aspects that render them a necessity rather than a luxury. First, we demonstrate their power in stabilizing the complex interplay between the \"deadly triad\" and the \"moving target\". We provide a convergent off-policy AC algorithm under linear functional approximation that dismantles several longstanding barriers between theory and practice: it utilizes target-based TD learning, accommodates dynamic behavior policies, and operates without the restrictive \"full coverage\" assumptions. By formalizing a dual trust-coverage mechanism, our framework provides principled guidelines for pursuing sample efficiency-rigorously governing behavior policy updates and critic re-evaluations to maximize off-policy data utility. Second, we uncover a foundational link between functional critics and efficient exploration. We demonstrate that existing model-free approximations of posterior sampling are limited in capturing policy-dependent uncertainty, a gap the functional critic formalism bridges. These results represent, to our knowledge, first-of-their-kind contributions to the RL literature. Practically, we propose a tailored neural network architecture and a minimalist AC algorithm. In preliminary experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods without standard implementation heuristics.","short_abstract":"The actor-critic (AC) framework has achieved strong empirical success in off-policy reinforcement learning but suffers from the \"moving target\" problem, where the evaluated policy changes continually. Functional critics, or policy-conditioned value functions, address this by explicitly including a representation of the...","url_abs":"https://arxiv.org/abs/2509.22964","url_pdf":"https://arxiv.org/pdf/2509.22964v4","authors":"[\"Qinxun Bai\",\"Yuxuan Han\",\"Wei Xu\",\"Zhengyuan Zhou\"]","published":"2025-09-26T21:55:26Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"LoRA\"]","has_code":false}
