{"ID":2899178,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.01679","arxiv_id":"2507.01679","title":"Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling","abstract":"Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.","short_abstract":"Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT...","url_abs":"https://arxiv.org/abs/2507.01679","url_pdf":"https://arxiv.org/pdf/2507.01679v3","authors":"[\"Zeyu Huang\",\"Tianhao Cheng\",\"Zihan Qiu\",\"Zili Wang\",\"Yinghui Xu\",\"Edoardo M. Ponti\",\"Ivan Titov\"]","published":"2025-07-02T13:04:09Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"LoRA\"]","has_code":false}
