{"ID":2828288,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13996","arxiv_id":"2512.13996","title":"DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training","abstract":"Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-$p$ implementations with fixed global probability thresholds provide only marginal gains over Top-$k$, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-$p$**, a sparsity-controllable dynamic routing mechanism that learns the Top-$p$ probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-$p$** consistently outperforms both Top-$k$ and fixed Top-$p$ baselines while matching the average FLOPs of Top-$k$ MoE. Our analysis confirms that **DTop-$p$** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.","short_abstract":"Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-$p$ routing is more adaptive because it selects experts unti...","url_abs":"https://arxiv.org/abs/2512.13996","url_pdf":"https://arxiv.org/pdf/2512.13996v2","authors":"[\"Can Jin\",\"Hongwu Peng\",\"Mingcan Xiang\",\"Qixin Zhang\",\"Xiangchi Yuan\",\"Amit Hasan\",\"Ohiremen Dibua\",\"Yifan Gong\",\"Yan Kang\",\"Dimitris N. Metaxas\"]","published":"2025-12-16T01:28:57Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Language Model\"]","has_code":false}