{"ID":2879206,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.17137","arxiv_id":"2508.17137","title":"MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices","abstract":"The deployment of large-scale Mixture-of-Experts (MoE) models on edge devices presents significant challenges due to memory constraints. While MoE architectures enable efficient utilization of computational resources by activating only a subset of experts per inference, they require careful memory management to operate efficiently in resource-constrained environments. Traditional heuristic-based expert caching strategies such as MoE-Infinity struggle to maintain high cache hit rates as models parameters scale. In this work, we introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding. By framing the task as a multi-label sequence prediction problem, we train a lightweight transformer model on 66 million expert activation traces extracted from LDJnr-Puffin dataset [5] using DeepSeek-V2-Chat-Lite MoE. Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset [6], achieving 97.5% accuracy and an 86.6% F1-score. Simulation results show that MoE-Beyond improves GPU cache hit rate from 17% to 72% when only 10% of experts fit in GPU cache, outperforming heuristic baselines.","short_abstract":"The deployment of large-scale Mixture-of-Experts (MoE) models on edge devices presents significant challenges due to memory constraints. While MoE architectures enable efficient utilization of computational resources by activating only a subset of experts per inference, they require careful memory management to operate...","url_abs":"https://arxiv.org/abs/2508.17137","url_pdf":"https://arxiv.org/pdf/2508.17137v1","authors":"[\"Nishant Gavhane\",\"Arush Mehrotra\",\"Rohit Chawla\",\"Peter Proenca\"]","published":"2025-08-23T20:28:32Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\"]","has_code":false}
