{"ID":2832932,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04673","arxiv_id":"2512.04673","title":"Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models","abstract":"Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoning, and code understanding abilities remains underexplored. In this work, we present a comprehensive evaluation of five general-purpose and three code-specific state-of-the-art LLMs across six diverse benchmarks encompassing linguistic competence, mathematical reasoning, and trustworthiness. Additionally, we analyze model behavior on the CoNaLa dataset for code explanation, comparing natural language and code-specialized LLMs. Our findings reveal that models optimized for code (e.g., CodeLLaMA variants) exhibit strong reasoning and syntactic precision, that even for non-coding tasks can show measurable performance gains, in contrast to general-purpose models like Mistral-7B and Llama-3-8B.","short_abstract":"Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model capabilities, a systematic cross-domain comparison that unifies linguistic, reasoni...","url_abs":"https://arxiv.org/abs/2512.04673","url_pdf":"https://arxiv.org/pdf/2512.04673v1","authors":"[\"Gunjan Das\",\"Paheli Bhattacharya\",\"Rishabh Gupta\"]","published":"2025-12-04T11:06:33Z","proceeding":"cs.SE","tasks":"[\"cs.SE\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
