{"ID":2879987,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.15919","arxiv_id":"2508.15919","title":"HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling","abstract":"Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicability in real-world deployments with heterogeneous requests, variable prompt lengths, and elastic scaling requirements. We present HFX, a production LLM serving system that jointly optimizes request scheduling and elastic scaling across model replicas to satisfy diverse SLOs. HFX introduces a \\textbf{scheduler} that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests. HFX also integrates a \\textbf{scaler} that supports fast device-to-device (D2D) weight transfer, reducing cold-start latency. Additionally, the system supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. Through extensive experiments on multi-task workloads, we demonstrate consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44$\\times$, 65.82\\%, and 49.81\\%, respectively, compared to state-of-the-art systems. Our results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.","short_abstract":"Large language model (LLM) serving faces the dual challenge of meeting strict user-specific service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Existing approaches either rely on static scheduling policies or focus on single-task settings, limiting their applicabilit...","url_abs":"https://arxiv.org/abs/2508.15919","url_pdf":"https://arxiv.org/pdf/2508.15919v3","authors":"[\"Zahra Yousefijamarani\",\"Xinglu Wang\",\"Qian Wang\",\"Morgan Lindsay Heisler\",\"Taha Shabani\",\"Niloofar Gholipour\",\"Parham Yassini\",\"Hong Chang\",\"Kan Chen\",\"Qiantao Zhang\",\"Xiaolong Bai\",\"Jiannan Wang\",\"Ying Xiong\",\"Yong Zhang\",\"Zhenan Fan\"]","published":"2025-08-21T18:40:20Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
