{"ID":2858341,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08531","arxiv_id":"2510.08531","title":"SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models","abstract":"Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.","short_abstract":"Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the...","url_abs":"https://arxiv.org/abs/2510.08531","url_pdf":"https://arxiv.org/pdf/2510.08531v1","authors":"[\"Hongxing Li\",\"Dingming Li\",\"Zixuan Wang\",\"Yuchen Yan\",\"Hang Wu\",\"Wenqi Zhang\",\"Yongliang Shen\",\"Weiming Lu\",\"Jun Xiao\",\"Yueting Zhuang\"]","published":"2025-10-09T17:50:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Language Model\"]","has_code":false}
