{"ID":2865816,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03243","arxiv_id":"2510.03243","title":"Prompt-Aware Scheduling for Low-Latency LLM Serving","abstract":"Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued behind them. In this paper, we introduce PARS, a prompt-aware LLM task scheduler that improves serving efficiency by approximating shortest-job-first (SJF) scheduling through pairwise ranking with margin ranking loss. PARS focuses on impactful scheduling decisions and is seamlessly integrated into the state-of-the-art LLM serving system vLLM. It effectively predicts response-length-based task ordering, reducing latency with minimal overhead. Extensive experiments across multiple LLMs and real-world inference datasets show that PARS significantly improves performance, including for reasoning workloads. Furthermore, our cross-model evaluations demonstrate that the design generalizes well, enabling effective scheduling even when predictors are trained on different LLMs.","short_abstract":"Efficient scheduling of LLM inference tasks is essential for achieving low latency and high throughput, particularly with the growing use of reasoning-capable LLMs. Traditional strategies like First-Come-First-Serve (FCFS) often suffer from Head-of-Line (HOL) blocking, where long-running tasks delay shorter ones queued...","url_abs":"https://arxiv.org/abs/2510.03243","url_pdf":"https://arxiv.org/pdf/2510.03243v2","authors":"[\"Yiheng Tao\",\"Yihe Zhang\",\"Matthew T. Dearing\",\"Xin Wang\",\"Yuping Fan\",\"Zhiling Lan\"]","published":"2025-09-25T07:26:38Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.DC\",\"cs.PF\"]","methods":"[\"Large Language Model\"]","has_code":false}
