{"ID":2871678,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09990","arxiv_id":"2509.09990","title":"CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China","abstract":"Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.","short_abstract":"Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To addre...","url_abs":"https://arxiv.org/abs/2509.09990","url_pdf":"https://arxiv.org/pdf/2509.09990v1","authors":"[\"Guixian Xu\",\"Zeli Su\",\"Ziyin Zhang\",\"Jianing Liu\",\"XU Han\",\"Ting Zhang\",\"Yushuang Dong\"]","published":"2025-09-12T06:18:44Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[]","has_code":false}
