{"ID":2850496,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.21184","arxiv_id":"2510.21184","title":"Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference","abstract":"Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs' probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.","short_abstract":"Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired o...","url_abs":"https://arxiv.org/abs/2510.21184","url_pdf":"https://arxiv.org/pdf/2510.21184v1","authors":"[\"Stephen Zhao\",\"Aidan Li\",\"Rob Brekelmans\",\"Roger Grosse\"]","published":"2025-10-24T06:23:55Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\",\"stat.ML\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
