{"ID":2922118,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00761","arxiv_id":"2606.00761","title":"Confidence-Adaptive SwiGLU for Mixture-of-Experts","abstract":"SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.","short_abstract":"SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that...","url_abs":"https://arxiv.org/abs/2606.00761","url_pdf":"https://arxiv.org/pdf/2606.00761v1","authors":"[\"Shaohua Li\",\"Xiuchao Sui\",\"Xiaobing Sun\",\"Yuhang Wu\",\"Liangli Zhen\",\"Yong Liu\",\"Rick Siow Mong Goh\"]","published":"2026-05-30T14:58:52Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":612641,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T02:42:49.606572591Z","DeletedAt":null,"paper_id":2922118,"paper_url":"https://arxiv.org/abs/2606.00761","paper_title":"Confidence-Adaptive SwiGLU for Mixture-of-Experts","repo_url":"https://github.com/askerlee/kappa-swiglu","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
