{"ID":2887088,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.02668","arxiv_id":"2508.02668","title":"LOST: Low-rank and Sparse Pre-training for Large Language Models","abstract":"While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training cost. In this context, sparsity is often employed as a complementary technique to recover important information lost in low-rank compression by capturing salient features in the residual space. However, existing approaches typically combine low-rank and sparse components in a simplistic or ad hoc manner, often resulting in undesirable performance degradation compared to full-rank training. In this paper, we propose \\textbf{LO}w-rank and \\textbf{S}parse pre-\\textbf{T}raining (\\textbf{LOST}) for LLMs, a novel method that ingeniously integrates low-rank and sparse structures to enable effective training of LLMs from scratch under strict efficiency constraints. LOST applies singular value decomposition to weight matrices, preserving the dominant low-rank components, while allocating the remaining singular values to construct channel-wise sparse components to complement the expressiveness of low-rank training. We evaluate LOST on LLM pretraining ranging from 60M to 7B parameters. Our experiments show that LOST achieves competitive or superior performance compared to full-rank models, while significantly reducing both memory and compute overhead. Moreover, Code is available at \\href{https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models}{LOST Repo}","short_abstract":"While large language models (LLMs) have achieved remarkable performance across a wide range of tasks, their massive scale incurs prohibitive computational and memory costs for pre-training from scratch. Recent studies have investigated the use of low-rank parameterization as a means of reducing model size and training...","url_abs":"https://arxiv.org/abs/2508.02668","url_pdf":"https://arxiv.org/pdf/2508.02668v1","authors":"[\"Jiaxi Li\",\"Lu Yin\",\"Li Shen\",\"Jinjin Xu\",\"Liwu Xu\",\"Tianjin Huang\",\"Wenwu Wang\",\"Shiwei Liu\",\"Xilu Wang\"]","published":"2025-08-04T17:58:22Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611393,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2887088,"paper_url":"https://arxiv.org/abs/2508.02668","paper_title":"LOST: Low-rank and Sparse Pre-training for Large Language Models","repo_url":"https://github.com/JiaxiLi1/LOST-Low-rank-and-Sparse-Training-for-Large-Language-Models","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}