{"ID":2868254,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.17238","arxiv_id":"2509.17238","title":"MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE","abstract":"The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.","short_abstract":"The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates...","url_abs":"https://arxiv.org/abs/2509.17238","url_pdf":"https://arxiv.org/pdf/2509.17238v2","authors":"[\"Soheil Zibakhsh\",\"Mohammad Samragh\",\"Kumari Nishu\",\"Lauren Hannah\",\"Arnav Kundu\",\"Minsik Cho\"]","published":"2025-09-21T21:05:29Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
