{"ID":2854223,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15830","arxiv_id":"2510.15830","title":"SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients","abstract":"The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo's surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step-$K$ Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5$\\times$ in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.","short_abstract":"The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory -...","url_abs":"https://arxiv.org/abs/2510.15830","url_pdf":"https://arxiv.org/pdf/2510.15830v1","authors":"[\"Dominik Kallusky\",\"Vinay Rao\",\"Vishal Nandavanam\",\"Hao-Jun Michael Shi\"]","published":"2025-10-17T17:11:45Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
