{"ID":2877900,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.18672","arxiv_id":"2508.18672","title":"Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks","abstract":"Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-$k$ routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.","short_abstract":"Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlo...","url_abs":"https://arxiv.org/abs/2508.18672","url_pdf":"https://arxiv.org/pdf/2508.18672v3","authors":"[\"Taishi Nakamura\",\"Satoki Ishikawa\",\"Masaki Kawamura\",\"Takumi Okamoto\",\"Daisuke Nohara\",\"Jun Suzuki\",\"Rio Yokota\"]","published":"2025-08-26T04:31:28Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":610425,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2877900,"paper_url":"https://arxiv.org/abs/2508.18672","paper_title":"Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks","repo_url":"https://github.com/rioyokotalab/optimal-sparsity","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
