{"ID":2861555,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02148","arxiv_id":"2510.02148","title":"Policy Gradient Guidance Enables Test Time Control","abstract":"We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates behavior without retraining. We provide a theoretical derivation showing that the additional normalization term vanishes under advantage estimation, leading to a clean guided policy gradient update. Empirically, we evaluate PGG on discrete and continuous control benchmarks. We find that conditioning dropout-central to diffusion guidance-offers gains in simple discrete tasks and low sample regimes, but dropout destabilizes continuous control. Training with modestly larger guidance ($γ\u003e1$) consistently improves stability, sample efficiency, and controllability. Our results show that guidance, previously confined to diffusion policies, can be adapted to standard on-policy methods, opening new directions for controllable online reinforcement learning.","short_abstract":"We introduce Policy Gradient Guidance (PGG), a simple extension of classifier-free guidance from diffusion models to classical policy gradient methods. PGG augments the policy gradient with an unconditional branch and interpolates conditional and unconditional branches, yielding a test-time control knob that modulates...","url_abs":"https://arxiv.org/abs/2510.02148","url_pdf":"https://arxiv.org/pdf/2510.02148v1","authors":"[\"Jianing Qi\",\"Hao Tang\",\"Zhigang Zhu\"]","published":"2025-10-02T16:00:35Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\"]","has_code":false}
