{"ID":2864165,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23667","arxiv_id":"2509.23667","title":"Why Alignment Must Precede Distillation: A Minimal Working Explanation","abstract":"For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment's reference model: its distributional recall. We show that the standard KD -\u003e Align workflow diminishes the model's capacity to align rare yet desirable behaviors, even under strong preference signals. We instead demonstrate that reversing the pipeline (i.e., Align -\u003e KD) is essential: alignment must first be performed on a high-recall reference before distillation. Our contributions are threefold. First, we provide a minimal working explanation of how the reference model constrains preference alignment objectives at a fundamental level. Second, we validate this theory in a controllable Mixture-of-Gaussians experiment, where low-recall anchoring consistently results in suboptimal model performance. Finally, we demonstrate that the same phenomenon holds in LLM alignment with the SmolLM2 family: models aligned after KD fail to effectively align target behaviors, resulting in substantially lower reward and target precision. In contrast, our proposed Align -\u003e KD pipeline robustly aligns these behaviors, yielding models with superior target-oriented metrics and lower variance. Together, these results establish reference-model recall as a first-order design choice in alignment, offering a clear principle: alignment must precede distillation.","short_abstract":"For efficiency, preference alignment is often performed on compact, knowledge-distilled (KD) models. We argue this common practice introduces a significant limitation by overlooking a key property of the alignment's reference model: its distributional recall. We show that the standard KD -\u003e Align workflow diminishes th...","url_abs":"https://arxiv.org/abs/2509.23667","url_pdf":"https://arxiv.org/pdf/2509.23667v1","authors":"[\"Sungmin Cha\",\"Kyunghyun Cho\"]","published":"2025-09-28T06:12:19Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
