{"ID":2850939,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.20150","arxiv_id":"2510.20150","title":"Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning","abstract":"Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.","short_abstract":"Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats,...","url_abs":"https://arxiv.org/abs/2510.20150","url_pdf":"https://arxiv.org/pdf/2510.20150v5","authors":"[\"Yaochen Zhu\",\"Harald Steck\",\"Dawen Liang\",\"Yinhan He\",\"Vito Ostuni\",\"Jundong Li\",\"Nathan Kallus\"]","published":"2025-10-23T02:56:00Z","proceeding":"cs.IR","tasks":"[\"cs.IR\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607849,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2850939,"paper_url":"https://arxiv.org/abs/2510.20150","paper_title":"Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning","repo_url":"https://github.com/yaochenzhu/Rank-GRPO","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
