{"ID":2896430,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.06567","arxiv_id":"2507.06567","title":"SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference","abstract":"Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.","short_abstract":"Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage/memory burden for an edge device. To address this challenge, we consider...","url_abs":"https://arxiv.org/abs/2507.06567","url_pdf":"https://arxiv.org/pdf/2507.06567v3","authors":"[\"Qian Chen\",\"Xianhao Chen\",\"Kaibin Huang\"]","published":"2025-07-09T05:43:43Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.DC\",\"cs.NI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}