{"ID":2895681,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.08513","arxiv_id":"2507.08513","title":"Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation","abstract":"Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.","short_abstract":"Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptio...","url_abs":"https://arxiv.org/abs/2507.08513","url_pdf":"https://arxiv.org/pdf/2507.08513v2","authors":"[\"Liu He\",\"Xiao Zeng\",\"Yizhi Song\",\"Albert Y. C. Chen\",\"Lu Xia\",\"Shashwat Verma\",\"Sankalp Dayal\",\"Min Sun\",\"Cheng-Hao Kuo\",\"Daniel Aliaga\"]","published":"2025-07-11T12:00:10Z","proceeding":"cs.GR","tasks":"[\"cs.GR\",\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\"]","has_code":false}
