{"ID":2861042,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03160","arxiv_id":"2510.03160","title":"SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus","abstract":"Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.","short_abstract":"Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. Howev...","url_abs":"https://arxiv.org/abs/2510.03160","url_pdf":"https://arxiv.org/pdf/2510.03160v3","authors":"[\"Ming Zhao\",\"Wenhui Dong\",\"Yang Zhang\",\"Xiang Zheng\",\"Zhonghao Zhang\",\"Zian Zhou\",\"Yunzhi Guan\",\"Liukun Xu\",\"Wei Peng\",\"Zhaoyang Gong\",\"Zhicheng Zhang\",\"Dachuan Li\",\"Xiaosheng Ma\",\"Yuli Ma\",\"Jianing Ni\",\"Changjiang Jiang\",\"Lixia Tian\",\"Qixin Chen\",\"Kaishun Xia\",\"Pingping Liu\",\"Tongshun Zhang\",\"Zhiqiang Liu\",\"Zhongyan Bi\",\"Chenyang Si\",\"Tiansheng Sun\",\"Caifeng Shan\"]","published":"2025-10-03T16:32:02Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}