{"ID":2855276,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.13668","arxiv_id":"2510.13668","title":"STAR: Decode-Phase Rescheduling for LLM Inference","abstract":"Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, which often results in SLO violations and OOM failures under evolving decode workloads. In this paper, we propose STAR, a decode rescheduling system powered by length prediction to anticipate future workloads. Our core contributions include: (1) A lightweight and continuous LLM-native prediction method that leverages LLM hidden state to model remaining generation length with high precision (reducing MAE by 49.42%) and low overhead (cutting predictor parameters by 93.28%); (2) A rescheduling solution in decode phase with a dynamic balancing mechanism that integrates current and predicted workloads, reducing P99 TPOT by 75.1% and achieving 2.63 times higher goodput.","short_abstract":"Large Language Model (LLM) inference has emerged as a fundamental paradigm, however, variations in output length cause severe workload imbalance in the decode phase, particularly for long-output reasoning tasks. Existing systems, such as PD disaggregation architectures, rely on static prefill-to-decode scheduling, whic...","url_abs":"https://arxiv.org/abs/2510.13668","url_pdf":"https://arxiv.org/pdf/2510.13668v2","authors":"[\"Zhibin Wang\",\"Zetao Hong\",\"Xue Li\",\"Zibo Wang\",\"Shipeng Li\",\"Qingkai Meng\",\"Qing Wang\",\"Chengying Huan\",\"Rong Gu\",\"Sheng Zhong\",\"Chen Tian\"]","published":"2025-10-15T15:29:08Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
