{"ID":2863834,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25137","arxiv_id":"2509.25137","title":"The Era of Real-World Human Interaction: RL from User Conversations","abstract":"We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.","short_abstract":"We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a par...","url_abs":"https://arxiv.org/abs/2509.25137","url_pdf":"https://arxiv.org/pdf/2509.25137v1","authors":"[\"Chuanyang Jin\",\"Jing Xu\",\"Bo Liu\",\"Leitian Tao\",\"Olga Golovneva\",\"Tianmin Shu\",\"Wenting Zhao\",\"Xian Li\",\"Jason Weston\"]","published":"2025-09-29T17:50:31Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Reinforcement Learning\",\"Generative Adversarial Network\"]","has_code":false}
