{"ID":3083811,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T07:23:37.79250861Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06021","arxiv_id":"2606.06021","title":"OPRD: On-Policy Representation Distillation","abstract":"On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.","short_abstract":"On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as...","url_abs":"https://arxiv.org/abs/2606.06021","url_pdf":"https://arxiv.org/pdf/2606.06021v1","authors":"[\"Shenzhi Yang\",\"Guangcheng Zhu\",\"Bowen Song\",\"Haobo Wang\",\"Mingxuan Xia\",\"Xing Zheng\",\"Yingfan Ma\",\"Zhongqi Chen\",\"Weiqiang Wang\",\"Gang Chen\"]","published":"2026-06-04T11:13:01Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[]","has_code":false,"code_links":[{"ID":612837,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_id":3083811,"paper_url":"https://arxiv.org/abs/2606.06021","paper_title":"OPRD: On-Policy Representation Distillation","repo_url":"https://github.com/ShenzhiYang2000/OPRD","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
