{"ID":2921675,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T05:56:00.181519634Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01155","arxiv_id":"2606.01155","title":"When Data Is Scarce: Scaling Sparse Language Models with Repeated Training","abstract":"Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held-out dense-equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data-limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3. Resource trade-offs: With fixed data, loss-optimal sparsity is moderate ~ 50%, while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse-dc-scaling.","short_abstract":"Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, spar...","url_abs":"https://arxiv.org/abs/2606.01155","url_pdf":"https://arxiv.org/pdf/2606.01155v1","authors":"[\"Boqian Wu\",\"Qiao Xiao\",\"Patrik Okanovic\",\"Tomasz Sternal\",\"Maurice van Keulen\",\"Mykola Pechenizkiy\",\"Elena Mocanu\",\"Torsten Hoefler\",\"Decebal Constantin Mocanu\"]","published":"2026-05-31T10:51:18Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":612594,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T02:42:49.606572591Z","DeletedAt":null,"paper_id":2921675,"paper_url":"https://arxiv.org/abs/2606.01155","paper_title":"When Data Is Scarce: Scaling Sparse Language Models with Repeated Training","repo_url":"https://github.com/boqian333/sparse-dc-scaling","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
