{"ID":2862075,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00855","arxiv_id":"2510.00855","title":"Can World Models Benefit VLMs for World Dynamics?","abstract":"Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant conventional vision encoder paradigms for general-purpose multimodal understanding? While recent studies have begun to explore the potential of world models on common vision tasks, these explorations typically lack a systematic investigation of generic, multimodal tasks. In this work, we strive to investigate the capabilities when world model priors are transferred into Vision-Language Models: we re-purpose a video diffusion model as a generative encoder to perform a single denoising step and treat the resulting latents as a set of visual embedding. We empirically investigate this class of models, which we refer to as World-Language Models (WorldLMs), and we find that generative encoders can capture latents useful for downstream understanding that show distinctions from conventional encoders. Naming our best-performing variant Dynamic Vision Aligner (DyVA), we further discover that this method significantly enhances spatial reasoning abilities and enables single-image models to perform multi-frame reasoning. Through the curation of a suite of visual reasoning tasks, we find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance. We attribute these gains to WorldLM's inherited motion-consistency internalization from video pre-training. Finally, we systematically explore extensive model designs to highlight promising directions for future work. We hope our study can pave the way for a new family of VLMs that leverage priors from world models and are on a promising path towards generalist vision learners.","short_abstract":"Trained on internet-scale video data, generative world models are increasingly recognized as powerful world simulators that can generate consistent and plausible dynamics over structure, motion, and physics. This raises a natural question: with the advent of strong video foundational models, might they supplant convent...","url_abs":"https://arxiv.org/abs/2510.00855","url_pdf":"https://arxiv.org/pdf/2510.00855v1","authors":"[\"Kevin Zhang\",\"Kuangzhi Ge\",\"Xiaowei Chi\",\"Renrui Zhang\",\"Shaojun Shi\",\"Zhen Dong\",\"Sirui Han\",\"Shanghang Zhang\"]","published":"2025-10-01T13:07:05Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Diffusion Model\",\"Language Model\",\"LoRA\"]","has_code":false}
