{"ID":2837199,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07846","arxiv_id":"2512.07846","title":"MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-Interaction","abstract":"Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking systems often create long context prefill-heavy workloads, as the model has to be presented with the user, query and item information. To this end, we propose MixLM, a novel LLM-based ranking framework, which significantly improves the system throughput via reducing the input context length, while preserving the semantic strength of cross-encoder rankers. In contrast to a standard ranking system where the context is presented to the model as pure text, we propose to use mix-interaction, a mixture of text and embedding tokens to represent the input. Specifically, MixLM encodes all items in the catalog into a few embedding tokens and stores in a nearline cache. The encoded item descriptions are used during online inference, effectively reducing the item length from a few thousand text tokens to a few embedding tokens. We share insights from deploying our MixLM framework to a real-world search application at LinkedIn, including a detailed discussion of our training pipelines, as well as a thorough analysis of our online serving infrastructure optimization. With the same latency budget and on-par relevance metrics, MixLM increased throughput by 10.0x comparing with strong baselines, 75.9x over full-text LLM rerankers. The efficiency gains delivered by MixLM enabled full-traffic deployment of LLM-powered search, which resulted in a significant 0.47\\% increase in Daily Active Users (DAU) in online A/B tests.","short_abstract":"Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking syste...","url_abs":"https://arxiv.org/abs/2512.07846","url_pdf":"https://arxiv.org/pdf/2512.07846v2","authors":"[\"Guoyao Li\",\"Ran He\",\"Shusen Jing\",\"Kayhan Behdin\",\"Yubo Wang\",\"Sundara Raman Ramachandran\",\"Chanh Nguyen\",\"Jian Sheng\",\"Xiaojing Ma\",\"Chuanrui Zhu\",\"Sriram Vasudevan\",\"Muchen Wu\",\"Sayan Ghosh\",\"Lin Su\",\"Qingquan Song\",\"Xiaoqing Wang\",\"Zhipeng Wang\",\"Qing Lan\",\"Yanning Chen\",\"Jingwei Wu\",\"Luke Simon\",\"Wenjing Zhang\",\"Qi Guo\",\"Fedor Borisyuk\"]","published":"2025-11-25T21:23:04Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
