{"ID":2893796,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.13390","arxiv_id":"2507.13390","title":"PARAM-1 BharatGen 2.9B Model","abstract":"Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where over 20 official languages and 100+ dialects coexist alongside phenomena like code-switching and diglossia. We introduce PARAM-1, a 2.9B parameter decoder-only, text-only language model trained from scratch with an explicit architectural and linguistic focus on Indian diversity. PARAM-1 is trained on a bilingual dataset consisting of only Hindi and English, constructed with a strong focus on fact-rich, high-quality content. It is guided by three core principles: equitable representation of Indic languages through a 25% corpus allocation; tokenization fairness via a SentencePiece tokenizer adapted to Indian morphological structures; and culturally aligned evaluation benchmarks across IndicQA, code-mixed reasoning, and socio-linguistic robustness tasks. By embedding diversity at the pretraining level-rather than deferring it to post-hoc alignment-PARAM-1 offers a design-first blueprint for equitable foundation modeling. Our results demonstrate that it serves as both a competent general-purpose model and a robust baseline for India-centric applications.","short_abstract":"Large Language Models (LLMs) have emerged as powerful general-purpose reasoning systems, yet their development remains dominated by English-centric data, architectures, and optimization paradigms. This exclusionary design results in structural under-representation of linguistically diverse regions such as India, where...","url_abs":"https://arxiv.org/abs/2507.13390","url_pdf":"https://arxiv.org/pdf/2507.13390v1","authors":"[\"Kundeshwar Pundalik\",\"Piyush Sawarkar\",\"Nihar Sahoo\",\"Abhishek Shinde\",\"Prateek Chanda\",\"Vedant Goswami\",\"Ajay Nagpal\",\"Atul Singh\",\"Viraj Thakur\",\"Vijay Dewane\",\"Aamod Thakur\",\"Bhargav Patel\",\"Smita Gautam\",\"Bhagwan Panditi\",\"Shyam Pawar\",\"Madhav Kotcha\",\"Suraj Racha\",\"Saral Sureka\",\"Pankaj Singh\",\"Rishi Bal\",\"Rohit Saluja\",\"Ganesh Ramakrishnan\"]","published":"2025-07-16T06:14:33Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
