{"ID":2836005,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.22481","arxiv_id":"2511.22481","title":"OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency","abstract":"Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\\%, and the superimposition of OmniProxy further slashes TTFT by 38\\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).","short_abstract":"Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end...","url_abs":"https://arxiv.org/abs/2511.22481","url_pdf":"https://arxiv.org/pdf/2511.22481v1","authors":"[\"Jun Wang\",\"Yunxiang Yao\",\"Wenwei Kuang\",\"Runze Mao\",\"Zhenhao Sun\",\"Zhuang Tao\",\"Ziyang Zhang\",\"Dengyu Li\",\"Jiajun Chen\",\"Zhili Wang\",\"Kai Cui\",\"Congzhi Cai\",\"Longwen Lan\",\"Ken Zhang\"]","published":"2025-11-27T14:13:47Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://gitee.com/omniai/omniinfer\"]","has_code":false}
