{"ID":2878088,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.18983","arxiv_id":"2508.18983","title":"SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution","abstract":"The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.","short_abstract":"The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that...","url_abs":"https://arxiv.org/abs/2508.18983","url_pdf":"https://arxiv.org/pdf/2508.18983v3","authors":"[\"Guoying Zhu\",\"Meng Li\",\"Haipeng Dai\",\"Xuechen Liu\",\"Weijun Wang\",\"Keran Li\",\"Jun xiao\",\"Ligeng Chen\",\"Wei Wang\"]","published":"2025-08-26T12:32:09Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Mixture of Experts\",\"Language Model\"]","has_code":false}
