{"ID":2967398,"CreatedAt":"2026-06-02T14:42:14.75819314Z","UpdatedAt":"2026-06-02T15:31:00.366878821Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2605.29343","arxiv_id":"2605.29343","title":"Draft-OPD: On-Policy Distillation for Speculative Draft Models","abstract":"Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\\% and 13\\%.","short_abstract":"Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SF...","url_abs":"https://arxiv.org/abs/2605.29343","url_pdf":"https://arxiv.org/pdf/2605.29343v2","authors":"[\"Haodi Lei\",\"Yafu Li\",\"Haoran Zhang\",\"Shunkai Zhang\",\"Qianjia Cheng\",\"Xiaoye Qu\",\"Ganqu Cui\",\"Bowen Zhou\",\"Ning Ding\",\"Yun Luo\",\"Yu Cheng\"]","published":"2026-05-28T04:30:22Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
