{"ID":2894101,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.12566","arxiv_id":"2507.12566","title":"Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models","abstract":"This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.","short_abstract":"This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our...","url_abs":"https://arxiv.org/abs/2507.12566","url_pdf":"https://arxiv.org/pdf/2507.12566v1","authors":"[\"Gen Luo\",\"Wenhan Dou\",\"Wenhao Li\",\"Zhaokai Wang\",\"Xue Yang\",\"Changyao Tian\",\"Hao Li\",\"Weiyun Wang\",\"Wenhai Wang\",\"Xizhou Zhu\",\"Yu Qiao\",\"Jifeng Dai\"]","published":"2025-07-16T18:31:23Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\",\"Generative Adversarial Network\"]","has_code":false,"code_links":[{"ID":612094,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2894101,"paper_url":"https://arxiv.org/abs/2507.12566","paper_title":"Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models","repo_url":"https://github.com/OpenGVLab/Mono-InternVL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
