{"ID":2840993,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12596","arxiv_id":"2511.12596","title":"Group-Aware Reinforcement Learning for Output Diversity in Large Language Models","abstract":"Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.","short_abstract":"Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy...","url_abs":"https://arxiv.org/abs/2511.12596","url_pdf":"https://arxiv.org/pdf/2511.12596v1","authors":"[\"Oron Anschel\",\"Alon Shoshan\",\"Adam Botach\",\"Shunit Haviv Hakimi\",\"Asaf Gendler\",\"Emanuel Ben Baruch\",\"Nadav Bhonker\",\"Igor Kviatkovsky\",\"Manoj Aggarwal\",\"Gerard Medioni\"]","published":"2025-11-16T13:42:55Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false}
