{"ID":2880540,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16646","arxiv_id":"2508.16646","title":"Equinox: Holistic Fair Scheduling in Serving Large Language Models","abstract":"We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since these metrics are only available post-execution, creating a scheduling paradox, we introduce a deterministic Mixture of Prediction Experts (MoPE) framework to predict user-perceived latency, output tokens, throughput, and GPU utilization. These predictions enable calculation of a unified Holistic Fairness score that balances both counters through tunable parameters for proactive fairness-aware scheduling. We implement this in Equinox, an open-source system with other optimizations like adaptive batching, and stall-free scheduling. Evaluations on production traces (ShareGPT, LMSYS) and synthetic workloads demonstrate Equinox achieves up to $1.3\\times$ higher throughput, 60\\% lower time-to-first-token latency, and 13\\% higher fairness versus VTC while maintaining 94\\% GPU utilization, proving fairness under bounded discrepancy across heterogeneous platforms.","short_abstract":"We address the limitations of current LLM serving with a dual-counter framework separating user and operator perspectives. The User Fairness Counter measures quality of service via weighted tokens and latency; the Resource Fairness Counter measures operational efficiency through throughput and GPU utilization. Since th...","url_abs":"https://arxiv.org/abs/2508.16646","url_pdf":"https://arxiv.org/pdf/2508.16646v1","authors":"[\"Zhixiang Wei\",\"James Yen\",\"Jingyi Chen\",\"Ziyang Zhang\",\"Zhibai Huang\",\"Chen Chen\",\"Xingzi Yu\",\"Yicheng Gu\",\"Chenggang Wu\",\"Yun Wang\",\"Mingyuan Xia\",\"Jie Wu\",\"Hao Wang\",\"Zhengwei Qi\"]","published":"2025-08-19T06:17:17Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}