{"ID":2884637,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06133","arxiv_id":"2508.06133","title":"LLM Serving Optimization with Variable Prefill and Decode Lengths","abstract":"We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arriving together, we schedule mixed prefill and decode batches to minimize total end-to-end latency. We show that heterogeneity in prompt lengths makes the problem computationally intractable and that widely used heuristics such as first-come-first-served and shortest-first can be arbitrarily suboptimal. We propose Sorted-F, which repeatedly forms feasible batches using a new selection metric that balances batch size against downstream decode cost, and prove it achieves a constant-factor guarantee on total latency. We further develop practical variants -- an exact solver for small instances and fast heuristics for larger ones -- and evaluate them on a public workload spanning short conversations and long-document summarization, where they consistently reduce average latency relative to standard baselines. Our results highlight that during peak-hour tidal backlogs, greedy GPU packing or short-request prioritization can perform poorly when prompt lengths vary widely, and provide a principled, tunable framework for designing production batch schedulers and planning capacity in memory-constrained LLM serving systems.","short_abstract":"We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV usage, and each generated token increases memory by one unit. Given a backlog of n requests arrivi...","url_abs":"https://arxiv.org/abs/2508.06133","url_pdf":"https://arxiv.org/pdf/2508.06133v3","authors":"[\"Meixuan Wang\",\"Yinyu Ye\",\"Zijie Zhou\"]","published":"2025-08-08T08:54:21Z","proceeding":"math.OC","tasks":"[\"math.OC\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}