{"ID":2844400,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2602.19006","arxiv_id":"2602.19006","title":"Evaluating Large Language Models on Quantum Mechanics: A Comparative Study Across Diverse Models and Tasks","abstract":"We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical computation, comprising 900 baseline and 75 tool-augmented assessments. Results reveal clear tier stratification: flagship models achieve 81\\% average accuracy, outperforming mid-tier (77\\%) and fast models (67\\%) by 4pp and 14pp respectively. Task difficulty patterns emerge distinctly: derivations show highest performance (92\\% average, 100\\% for flagship models), while numerical computation remains most challenging (42\\%). Tool augmentation on numerical tasks yields task-dependent effects: modest overall improvement (+4.4pp) at 3x token cost masks dramatic heterogeneity ranging from +29pp gains to -16pp degradation. Reproducibility analysis across three runs quantifies 6.3pp average variance, with flagship models demonstrating exceptional stability (GPT-5 achieves zero variance) while specialized models require multi-run evaluation. This work contributes: (i) a benchmark for quantum mechanics with automatic verification, (ii) systematic evaluation quantifying tier-based performance hierarchies, (iii) empirical analysis of tool augmentation trade-offs, and (iv) reproducibility characterization. All tasks, verifiers, and results are publicly released.","short_abstract":"We present a systematic evaluation of large language models on quantum mechanics problem-solving. Our study evaluates 15 models from five providers (OpenAI, Anthropic, Google, Alibaba, DeepSeek) spanning three capability tiers on 20 tasks covering derivations, creative problems, non-standard concepts, and numerical com...","url_abs":"https://arxiv.org/abs/2602.19006","url_pdf":"https://arxiv.org/pdf/2602.19006v1","authors":"[\"S. K. Rithvik\"]","published":"2025-11-09T15:39:24Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"quant-ph\"]","methods":"[\"Language Model\"]","has_code":false}
