{"ID":2881897,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15811","arxiv_id":"2508.15811","title":"From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System","abstract":"Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34\\% relative increase in user engagement as measured by click-through rate in live A/B tests.","short_abstract":"Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and use...","url_abs":"https://arxiv.org/abs/2508.15811","url_pdf":"https://arxiv.org/pdf/2508.15811v2","authors":"[\"Junhao Yin\",\"Haolin Wang\",\"Peng Bao\",\"Ju Xu\",\"Yongliang Wang\"]","published":"2025-08-15T10:17:01Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
