{"ID":2865503,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.22576","arxiv_id":"2509.22576","title":"EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning","abstract":"Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.","short_abstract":"Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. T...","url_abs":"https://arxiv.org/abs/2509.22576","url_pdf":"https://arxiv.org/pdf/2509.22576v2","authors":"[\"Wujiang Xu\",\"Wentian Zhao\",\"Zhenting Wang\",\"Yu-Jhe Li\",\"Can Jin\",\"Mingyu Jin\",\"Kai Mei\",\"Kun Wan\",\"Dimitris N. Metaxas\"]","published":"2025-09-26T16:51:44Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"LoRA\"]","has_code":false}
