{"ID":2877624,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.19831","arxiv_id":"2508.19831","title":"Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis","abstract":"Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench-Hi, GSM8K-Hi, ChatRAG-Hi, and BFCL-Hi. These were created using a methodology that combines from-scratch human annotation with a translate-and-verify process. We leverage this suite to conduct an extensive benchmarking of open-source LLMs supporting Hindi, providing a detailed comparative analysis of their current capabilities. Our curation process also serves as a replicable methodology for developing benchmarks in other low-resource languages.","short_abstract":"Evaluating instruction-tuned Large Language Models (LLMs) in Hindi is challenging due to a lack of high-quality benchmarks, as direct translation of English datasets fails to capture crucial linguistic and cultural nuances. To address this, we introduce a suite of five Hindi LLM evaluation datasets: IFEval-Hi, MT-Bench...","url_abs":"https://arxiv.org/abs/2508.19831","url_pdf":"https://arxiv.org/pdf/2508.19831v2","authors":"[\"Anusha Kamath\",\"Kanishk Singla\",\"Rakesh Paul\",\"Raviraj Joshi\",\"Utkarsh Vaidya\",\"Sanjay Singh Chauhan\",\"Niranjan Wartikar\"]","published":"2025-08-27T12:35:31Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}