{"ID":2849852,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23868","arxiv_id":"2510.23868","title":"GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA","abstract":"This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $β$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\\mathcal{L}_{\\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $π^{*}_β(y|x)\\proptoπ_{\\text{ref}}(y|x)e^{\\frac{1}βr_φ(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $β(x)=\\frac{σ_φ(x)}{\\hatσ_θ(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $β$ with a prompt-adaptive $β(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.","short_abstract":"This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying...","url_abs":"https://arxiv.org/abs/2510.23868","url_pdf":"https://arxiv.org/pdf/2510.23868v5","authors":"[\"Zhichao Wang\"]","published":"2025-10-27T21:18:19Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"RLHF\"]","has_code":false}
