{"ID":2895368,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09155","arxiv_id":"2507.09155","title":"OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering","abstract":"We introduce OPENXRD, a comprehensive benchmarking framework for evaluating large language models (LLMs) and multimodal LLMs (MLLMs) in crystallography question answering. The framework measures context assimilation, or how models use fixed, domain-specific supporting information during inference. The framework includes 217 expert-curated X-ray diffraction (XRD) questions covering fundamental to advanced crystallographic concepts, each evaluated under closed-book (without context) and open-book (with context) conditions, where the latter includes concise reference passages generated by GPT-4.5 and refined by crystallography experts. We benchmark 74 state-of-the-art LLMs and MLLMs, including GPT-4, GPT-5, O-series, LLaVA, LLaMA, QWEN, Mistral, and Gemini families, to quantify how different architectures and scales assimilate external knowledge. Results show that mid-sized models (7B--70B parameters) gain the most from contextual materials, while very large models often show saturation or interference and the largest relative gains appear in small and mid-sized models. Expert-reviewed materials provide significantly higher improvements than AI-generated ones even when token counts are matched, confirming that content quality, not quantity, drives performance. OPENXRD offers a reproducible diagnostic benchmark for assessing reasoning, knowledge integration, and guidance sensitivity in scientific domains, and provides a foundation for future multimodal and retrieval-augmented crystallography systems.","short_abstract":"We introduce OPENXRD, a comprehensive benchmarking framework for evaluating large language models (LLMs) and multimodal LLMs (MLLMs) in crystallography question answering. The framework measures context assimilation, or how models use fixed, domain-specific supporting information during inference. The framework include...","url_abs":"https://arxiv.org/abs/2507.09155","url_pdf":"https://arxiv.org/pdf/2507.09155v2","authors":"[\"Ali Vosoughi\",\"Ayoub Shahnazari\",\"Yufeng Xi\",\"Zeliang Zhang\",\"Griffin Hess\",\"Chenliang Xu\",\"Niaz Abdolrahim\"]","published":"2025-07-12T06:25:22Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
