{"ID":2886629,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.01977","arxiv_id":"2508.01977","title":"TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models","abstract":"To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible framework for dataset creation in low-resource settings, covering diverse domains and reasoning patterns essential for language understanding and generation. Building on this dataset, we develop the Sunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped with chain-of-thought capabilities. Trained entirely on TIBSTC-CoT, Sunshine-thinking has demonstrated strong reasoning and generation performance, comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks a significant step toward inclusive AI by enabling high-quality Tibetan language processing through both resource creation and model innovation. All data are available: https://github.com/Vicentvankor/sun-shine.","short_abstract":"To address the severe data scarcity in Tibetan, a low-resource language spoken by over six million people, we introduce TIBSTC-CoT, the large-scale, multi-domain Tibetan dataset automatically constructed via chain-of-thought prompting with large language models (LLMs). TIBSTC-CoT establishes a scalable and reproducible...","url_abs":"https://arxiv.org/abs/2508.01977","url_pdf":"https://arxiv.org/pdf/2508.01977v2","authors":"[\"Fan Gao\",\"Cheng Huang\",\"Nyima Tashi\",\"Yutong Liu\",\"Xiangxiang Wang\",\"Thupten Tsering\",\"Ban Ma-bao\",\"Renzeg Duojie\",\"Gadeng Luosang\",\"Rinchen Dongrub\",\"Dorje Tashi\",\"Xiao Feng\",\"Hao Wang\",\"Yongbin Yu\"]","published":"2025-08-04T01:32:58Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":611326,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2886629,"paper_url":"https://arxiv.org/abs/2508.01977","paper_title":"TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models","repo_url":"https://github.com/Vicentvankor/sun-shine","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
