{"ID":3083764,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T05:16:48.22291569Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06096","arxiv_id":"2606.06096","title":"OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation","abstract":"Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad","short_abstract":"Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objective...","url_abs":"https://arxiv.org/abs/2606.06096","url_pdf":"https://arxiv.org/pdf/2606.06096v1","authors":"[\"Paavo Parmas\",\"Yongmin Kim\",\"Kohsei Matsutani\",\"Shota Takashiro\",\"Soichiro Nishimori\",\"Takeshi Kojima\",\"Yusuke Iwasawa\",\"Yutaka Matsuo\"]","published":"2026-06-04T12:34:15Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":612829,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-05T06:46:15.197025399Z","DeletedAt":null,"paper_id":3083764,"paper_url":"https://arxiv.org/abs/2606.06096","paper_title":"OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation","repo_url":"https://github.com/paavo5/ordergrad","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
