{"ID":2895397,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09205","arxiv_id":"2507.09205","title":"From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan","abstract":"Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.","short_abstract":"Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this w...","url_abs":"https://arxiv.org/abs/2507.09205","url_pdf":"https://arxiv.org/pdf/2507.09205v5","authors":"[\"Lei Yang\",\"Leiyu Pan\",\"Bojian Xiong\",\"Renren Jin\",\"Shaowei Zhang\",\"Yue Chen\",\"Ling Shi\",\"Jiang Zhou\",\"Junru Wu\",\"Zhen Wang\",\"Jianxiang Peng\",\"Juesi Xiao\",\"Tianyu Dong\",\"Zhuowen Han\",\"Zhuo Chen\",\"Yuqi Ren\",\"Deyi Xiong\"]","published":"2025-07-12T08:54:05Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}