{"ID":2863278,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24261","arxiv_id":"2509.24261","title":"Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models","abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of solutions, boosting single-solution accuracy (pass@1) but suppressing solution diversity and multi-solution performance (pass@k). As a result, RLVR often distills existing capabilities rather than discovering new reasoning strategies. To overcome this, we introduce a Risk-Sensitive Reinforcement Learning framework. Our approach employs a risk-seeking objective that interpolates between mean and maximum rewards, leading to a novel algorithm, Risk-Sensitive GRPO (RS-GRPO), which drives deeper exploration by amplifying learning from challenging prompts. Remarkably, RS-GRPO is simple to implement, requiring only minor code modifications. On six mathematical reasoning benchmarks and with five different LLMs, RS-GRPO consistently improves pass@k performance while maintaining or enhancing pass@1 accuracy.","short_abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing Large Language Models (LLMs) on complex reasoning tasks. However, existing methods suffer from an exploration dilemma: the sharply peaked initial policies of pre-trained LLMs confine standard RL algorithms to a narrow set of soluti...","url_abs":"https://arxiv.org/abs/2509.24261","url_pdf":"https://arxiv.org/pdf/2509.24261v1","authors":"[\"Yuhua Jiang\",\"Jiawei Huang\",\"Yufeng Yuan\",\"Xin Mao\",\"Yu Yue\",\"Qianchuan Zhao\",\"Lin Yan\"]","published":"2025-09-29T04:12:20Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}
