{"ID":2851608,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19366","arxiv_id":"2510.19366","title":"MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs","abstract":"Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a \"quality cliff\", offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \\emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained \"sub-experts.\" This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \\emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9\\% under a strict latency budget or reduce latency by up to 10.36\\% under limited resources. MoE-Prism provides the critical \"control knob\" to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.","short_abstract":"Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a \"quality cliff\", offering only a few coarse-grained operating points. This inflexibility fo...","url_abs":"https://arxiv.org/abs/2510.19366","url_pdf":"https://arxiv.org/pdf/2510.19366v1","authors":"[\"Xinfeng Xia\",\"Jiacheng Liu\",\"Xiaofeng Hou\",\"Peng Tang\",\"Mingxuan Zhang\",\"Wenfeng Wang\",\"Chao Li\"]","published":"2025-10-22T08:40:01Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[]","has_code":false}
