{"ID":2863813,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25100","arxiv_id":"2509.25100","title":"ORPO-Distill: Mixed-Policy Preference Optimization for Cross-Architecture LLM Distillation","abstract":"We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that contrasts teacher and student traces for more effective learning, and adopts a mixed-policy strategy for utilizing student-generated outputs, outperforming both off- and on-policy alternatives. Experiments on five datasets and multiple student models show consistent improvements over conventional black-box KD baselines.","short_abstract":"We introduce ORPO-Distill, a general-purpose method for cross-architecture LLM distillation that formulates the problem as a preference optimization task. Unlike standard CoT distillation, the approach transfers knowledge through diverse reasoning traces. It employs an Odds-Ratio Preference Optimization objective that...","url_abs":"https://arxiv.org/abs/2509.25100","url_pdf":"https://arxiv.org/pdf/2509.25100v1","authors":"[\"Aasheesh Singh\",\"Vishal Vaddina\",\"Dagnachew Birru\"]","published":"2025-09-29T17:34:02Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}
