{"ID":2848681,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25817","arxiv_id":"2510.25817","title":"A Survey on Efficient Large Language Model Training: From Data-centric Perspectives","abstract":"Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM","short_abstract":"Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. T...","url_abs":"https://arxiv.org/abs/2510.25817","url_pdf":"https://arxiv.org/pdf/2510.25817v1","authors":"[\"Junyu Luo\",\"Bohan Wu\",\"Xiao Luo\",\"Zhiping Xiao\",\"Yiqiao Jin\",\"Rong-Cheng Tu\",\"Nan Yin\",\"Yifan Wang\",\"Jingyang Yuan\",\"Wei Ju\",\"Ming Zhang\"]","published":"2025-10-29T17:01:55Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":607637,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2848681,"paper_url":"https://arxiv.org/abs/2510.25817","paper_title":"A Survey on Efficient Large Language Model Training: From Data-centric Perspectives","repo_url":"https://github.com/luo-junyu/Awesome-Data-Efficient-LLM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
