{"ID":2890610,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19468","arxiv_id":"2507.19468","title":"Back to the Features: DINO as a Foundation for Video World Models","abstract":"We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.","short_abstract":"We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and i...","url_abs":"https://arxiv.org/abs/2507.19468","url_pdf":"https://arxiv.org/pdf/2507.19468v1","authors":"[\"Federico Baldassarre\",\"Marc Szafraniec\",\"Basile Terver\",\"Vasil Khalidov\",\"Francisco Massa\",\"Yann LeCun\",\"Patrick Labatut\",\"Maximilian Seitzer\",\"Piotr Bojanowski\"]","published":"2025-07-25T17:54:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}
