{"ID":2864160,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23661","arxiv_id":"2509.23661","title":"LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training","abstract":"We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.","short_abstract":"We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-qualit...","url_abs":"https://arxiv.org/abs/2509.23661","url_pdf":"https://arxiv.org/pdf/2509.23661v3","authors":"[\"Xiang An\",\"Yin Xie\",\"Kaicheng Yang\",\"Wenkang Zhang\",\"Xiuwei Zhao\",\"Zheng Cheng\",\"Yirui Wang\",\"Songcen Xu\",\"Changrui Chen\",\"Didi Zhu\",\"Chunsheng Wu\",\"Huajie Tan\",\"Chunyuan Li\",\"Jing Yang\",\"Jie Yu\",\"Xiyao Wang\",\"Bin Qin\",\"Yumeng Wang\",\"Zizhen Yan\",\"Ziyong Feng\",\"Ziwei Liu\",\"Bo Li\",\"Jiankang Deng\"]","published":"2025-09-28T05:52:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}
