{"ID":2886995,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02520","arxiv_id":"2508.02520","title":"Huawei Cloud Model-as-a-Service on the CloudMatrix384 SuperPod","abstract":"Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-server SuperPod with 384 Ascend 910C chips connected by a high-bandwidth UB fabric and global shared memory. It serves models including DeepSeek, Kimi, GLM, Qwen, and MiniMax, among others. xDeepServe is built around Transformerless, a disaggregated execution architecture that decomposes transformer inference into modular units -- attention, feedforward, and MoE -- and supports disaggregated Prefill-Decode and MoE-Attention deployments. To enable disaggregation, we develop XCCL, a memory-semantic communication layer providing microsecond-level point-to-point and scalable all-to-all primitives, and we extend FlowServe with decentralized DP groups and techniques to mitigate stragglers and synchronization variance. In a peak decoding configuration, xDeepServe reaches 2400 tokens/s per Ascend 910C chip at ~50ms time-per-output-token (TPOT).","short_abstract":"Scaled-out MoE LLMs and scaled-up SuperPods create new systems challenges for production Model-as-a-Service (MaaS), requiring disaggregation, low-latency communication, and decentralized serving. This report presents xDeepServe, the production serving system behind Huawei Cloud's MaaS offering on CloudMatrix384, a 48-s...","url_abs":"https://arxiv.org/abs/2508.02520","url_pdf":"https://arxiv.org/pdf/2508.02520v6","authors":"[\"Ao Xiao\",\"Bangzheng He\",\"Baoquan Zhang\",\"Baoxing Huai\",\"Bingji Wang\",\"Bo Wang\",\"Bo Xu\",\"Boyi Hou\",\"Chan Yang\",\"Changhong Liu\",\"Cheng Cui\",\"Chenyu Zhu\",\"Cong Feng\",\"Daohui Wang\",\"Dayun Lin\",\"Duo Zhao\",\"Fengshao Zou\",\"Fu Wang\",\"Gangqiang Zhang\",\"Gengyuan Dan\",\"Guanjie Chen\",\"Guodong Guan\",\"Guodong Yang\",\"Haifeng Li\",\"Haipei Zhu\",\"Haley Li\",\"Hao Feng\",\"Hao Huang\",\"Hao Xu\",\"Hengrui Ma\",\"Hengtao Fan\",\"Hui Liu\",\"Jia Li\",\"Jiang Liu\",\"Jiang Xu\",\"Jie Meng\",\"Jinhan Xin\",\"Junhao Hu\",\"Juwei Chen\",\"Lan Yu\",\"Lanxin Miao\",\"Liang Liu\",\"Linan Jing\",\"Lu Zhou\",\"Meina Han\",\"Mingkun Deng\",\"Mingyu Deng\",\"Naitian Deng\",\"Nizhong Lin\",\"Peihan Zhao\",\"Peng Pan\",\"Pengfei Shen\",\"Ping Li\",\"Qi Zhang\",\"Qian Wang\",\"Qin ZhC Qingrong Xia\",\"Qingyi Zhang\",\"Qunchao Fu\",\"Ren Guo\",\"Ruimin Gao\",\"Shaochun Li\",\"Sheng Long\",\"Shentian Li\",\"Shining Wan\",\"Shuai Shen\",\"Shuangfu Zeng\",\"Shuming Jing\",\"Siqi Yang\",\"Song Zhang\",\"Tao Xu\",\"Tianlin Du\",\"Ting Chen\",\"Wanxu Wu\",\"Wei Jiang\",\"Weinan Tong\",\"Weiwei Chen\",\"Wen Peng\",\"Wenli Zhou\",\"Wenquan Yang\",\"Wenxin Liang\",\"Xiang Liu\",\"Xiaoli Zhou\",\"Xin Jin\",\"Xinyu Duan\",\"Xu Li\",\"Xu Zhang\",\"Xusheng Chen\",\"Yalong Shan\",\"Yang Gan\",\"Yao Lu\",\"Yi Deng\",\"Yi Zheng\",\"Ying Xiong\",\"Yingfei Zheng\",\"Yiyun Zheng\",\"Yizhou Shan\",\"Yong Gao\",\"Yong Zhang\",\"Yongqiang Yang\",\"Yuanjin Gong\",\"Yue Yu\",\"Yuetao Chen\",\"Yukun Zhu\",\"Yulong He\",\"Yusu Zhao\",\"Yuyan Wu\",\"Zenan Zhang\",\"Zhaojin Zhuo\",\"Zhaoyang Ji\",\"Zhefeng Wang\",\"Zheng Wang\",\"Zhenan Fan\",\"Zhenhua Yang\",\"Zhenli Sheng\",\"Zhibin Yu\",\"Zhigang Ji\",\"Zhihao Ren\",\"Zhipeng Bian\",\"Zhixia Liu\",\"Zhiyu Dong\",\"Zhonghua Li\",\"Zhou Yu\",\"Zhuoming Shen\",\"Zhuwei Peng\",\"Zi Ye\",\"Zihao Xiang\",\"Zimin Fu\",\"Zixuan Zhang\"]","published":"2025-08-04T15:30:57Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Transformer\",\"Large Language Model\"]","has_code":false}
