{"ID":2894137,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.12642","arxiv_id":"2507.12642","title":"QSpark: Towards Reliable Qiskit Code Generation","abstract":"Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned the Qwen2.5-Coder-32B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29% Pass@1 ($\\approx+10$ pp over Granite-8B-QK) and GRPO hits 49%, both beating all general-purpose baselines; on the original HumanEval they score 65.90% and 63.00%. GRPO performs well on basic tasks (44/78) and excels on intermediate ones (41/68), but neither GRPO nor ORPO solves any of the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.","short_abstract":"Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned the Qwen2.5-Coder-32B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On t...","url_abs":"https://arxiv.org/abs/2507.12642","url_pdf":"https://arxiv.org/pdf/2507.12642v2","authors":"[\"Kiana Kheiri\",\"Aamna Aamir\",\"Andriy Miranskyy\",\"Chen Ding\"]","published":"2025-07-16T21:27:31Z","proceeding":"cs.SE","tasks":"[\"cs.SE\",\"cs.AI\",\"quant-ph\"]","methods":"[\"Large Language Model\"]","has_code":false}