{"ID":2823033,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.01088","arxiv_id":"2601.01088","title":"600k-ks-ocr: a large-scale synthetic dataset for optical character recognition in kashmiri script","abstract":"This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endangered Dardic language utilizing a modified Perso-Arabic writing system spoken by approximately seven million people. Each image is rendered at 256x64 pixels with corresponding ground-truth transcriptions provided in multiple formats compatible with CRNN, TrOCR, and generalpurpose machine learning pipelines. The generation methodology incorporates three traditional Kashmiri typefaces, comprehensive data augmentation simulating real-world document degradation, and diverse background textures to enhance model robustness. The dataset is distributed across ten partitioned archives totaling approximately 10.6 GB and is released under the CC-BY-4.0 license to facilitate research in low-resource language optical character recognition.","short_abstract":"This technical report presents the 600K-KS-OCR Dataset, a large-scale synthetic corpus comprising approximately 602,000 word-level segmented images designed for training and evaluating optical character recognition systems targeting Kashmiri script. The dataset addresses a critical resource gap for Kashmiri, an endange...","url_abs":"https://arxiv.org/abs/2601.01088","url_pdf":"https://arxiv.org/pdf/2601.01088v1","authors":"[\"Haq Nawaz Malik\"]","published":"2026-01-03T06:29:17Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[]","has_code":false}
