{"ID":2868342,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.16543","arxiv_id":"2509.16543","title":"ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions","abstract":"Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.","short_abstract":"Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical informat...","url_abs":"https://arxiv.org/abs/2509.16543","url_pdf":"https://arxiv.org/pdf/2509.16543v1","authors":"[\"Yue Huang\",\"Zhengzhe Jiang\",\"Xiaonan Luo\",\"Kehan Guo\",\"Haomin Zhuang\",\"Yujun Zhou\",\"Zhengqing Yuan\",\"Xiaoqi Sun\",\"Jules Schleinitz\",\"Yanbo Wang\",\"Shuhao Zhang\",\"Mihir Surve\",\"Nitesh V Chawla\",\"Olaf Wiest\",\"Xiangliang Zhang\"]","published":"2025-09-20T05:43:58Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
