{"ID":2885181,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.05289","arxiv_id":"2508.05289","title":"RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders","abstract":"Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity, or engagement patterns. In this paper, we share a fine-tuning solution using human feedback reinforcement learning (RLHF) to maximize implied user feedback (IUF) in a multi-turn recommendation context. We specify a reward model $R_φ$ learnt on weakly-labelled engagement information and maximize user-centric utility by optimizing the foundational LLM M_θ through a proximal policy optimization (PPO) approach. The architecture models conversational state transitions $s_t \\to a_t \\to s_{t +1}$, where the action $a_t$ is associated with LLM-generated item suggestions only on condition of conversation history in the past. The evaluation across synthetic and real-world datasets (e.g.REDIAL, OpenDialKG) demonstrates that our RLHF-fine-tuned models can perform better in terms of top-$k$ recommendation accuracy, coherence, and user satisfaction compared to (arrow-zero-cmwrquca-teja-falset ensuite 2Round group-deca States penalty give up This paper shows that implicit signal alignment can be efficient in achieving scalable and user-adaptive design of CRS.","short_abstract":"Conversational recommender systems (CRS) based on Large Language Models (LLMs) need to constantly be aligned to the user preferences to provide satisfying and context-relevant item recommendations. The traditional supervised fine-tuning cannot capture the implicit feedback signal, e.g., dwell time, sentiment polarity,...","url_abs":"https://arxiv.org/abs/2508.05289","url_pdf":"https://arxiv.org/pdf/2508.05289v1","authors":"[\"Zhongheng Yang\",\"Aijia Sun\",\"Yushang Zhao\",\"Yinuo Yang\",\"Dannier Li\",\"Chengrui Zhou\"]","published":"2025-08-07T11:36:55Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\",\"RLHF\"]","has_code":false}
