{"ID":2862595,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25913","arxiv_id":"2509.25913","title":"Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel","abstract":"Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \\textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\\mathrm{Softmax}$. We demonstrate that this router generalizes both $\\mathrm{Sigmoid}$- and $\\mathrm{Softmax}$-based routers. \\textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\\mathrm{ReLU}$ activation and $\\ell_2$-normalization in $\\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \\methodNorm.","short_abstract":"Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded...","url_abs":"https://arxiv.org/abs/2509.25913","url_pdf":"https://arxiv.org/pdf/2509.25913v2","authors":"[\"Chuanyang Zheng\",\"Jiankai Sun\",\"Yihang Gao\",\"Enze Xie\",\"Yuehao Wang\",\"Peihao Wang\",\"Ting Xu\",\"Matthew Chang\",\"Liliang Ren\",\"Jingyao Li\",\"Jing Xiong\",\"Kashif Rasul\",\"Mac Schwager\",\"Anderson Schneider\",\"Zhangyang Wang\",\"Yuriy Nevmyvaka\"]","published":"2025-09-30T08:04:02Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
