{"ID":2866381,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.19781","arxiv_id":"2509.19781","title":"Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference","abstract":"Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \\textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \\texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \\texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \\texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \\texttt{Tanbr} achieves a sublinear regret bound of {\\small $\\mathcal{O}(\\sqrt{T} \\log(T))$} over {\\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \\texttt{Tanbr} reduces inference latency by at least {\\small $45\\%$} and memory usage by up to {\\small $25\\%$}, while maintaining a high accuracy compared to many state-of-the-art methods.","short_abstract":"Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \\textit{online inference} remains challenging due to the large size of a ful...","url_abs":"https://arxiv.org/abs/2509.19781","url_pdf":"https://arxiv.org/pdf/2509.19781v2","authors":"[\"Ziyi Han\",\"Xutong Liu\",\"Ruiting Zhou\",\"Xiangxiang Dai\",\"John C. S. Lui\"]","published":"2025-09-24T06:04:10Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Mixture of Experts\",\"Transformer\"]","has_code":false}
