{"ID":3050198,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-04T19:30:45.638509697Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04036","arxiv_id":"2606.04036","title":"Self-Distilled Policy Gradient","abstract":"On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.","short_abstract":"On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler diverge...","url_abs":"https://arxiv.org/abs/2606.04036","url_pdf":"https://arxiv.org/pdf/2606.04036v1","authors":"[\"Yifeng Liu\",\"Shiyuan Zhang\",\"Yifan Zhang\",\"Quanquan Gu\"]","published":"2026-06-02T02:31:13Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612789,"CreatedAt":"2026-06-04T02:13:16.786527022Z","UpdatedAt":"2026-06-04T02:13:16.786527022Z","DeletedAt":null,"paper_id":3050198,"paper_url":"https://arxiv.org/abs/2606.04036","paper_title":"Self-Distilled Policy Gradient","repo_url":"https://github.com/lauyikfung/SDPG","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
