{"ID":2829837,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.11529","arxiv_id":"2512.11529","title":"xGR: Efficient Generative Recommendation Serving at Scale","abstract":"Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.","short_abstract":"Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typica...","url_abs":"https://arxiv.org/abs/2512.11529","url_pdf":"https://arxiv.org/pdf/2512.11529v2","authors":"[\"Qingxiao Sun\",\"Tongxuan Liu\",\"Shen Zhang\",\"Siyu Wu\",\"Peijun Yang\",\"Haotian Liang\",\"Menxin Li\",\"Xiaolong Ma\",\"Zhiwei Liang\",\"Ziyi Ren\",\"Minchao Zhang\",\"Xinyu Liu\",\"Ke Zhang\",\"Depei Qian\",\"Hailong Yang\"]","published":"2025-12-12T12:59:38Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\"]","has_code":false}
