{"ID":2828173,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.15840","arxiv_id":"2512.15840","title":"Large Video Planner Enables Generalizable Robot Control","abstract":"General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.","short_abstract":"General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition t...","url_abs":"https://arxiv.org/abs/2512.15840","url_pdf":"https://arxiv.org/pdf/2512.15840v2","authors":"[\"Boyuan Chen\",\"Tianyuan Zhang\",\"Haoran Geng\",\"Caiyi Zhang\",\"Peihao Li\",\"Kiwhan Song\",\"William T. Freeman\",\"Jitendra Malik\",\"Pieter Abbeel\",\"Russ Tedrake\",\"Vincent Sitzmann\",\"Yilun Du\"]","published":"2025-12-17T18:35:54Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","project_urls":"[\"https://www.boyuan.space/large-video-planner/\"]","has_code":false,"code_links":[{"ID":605855,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828173,"paper_url":"https://arxiv.org/abs/2512.15840","paper_title":"Large Video Planner Enables Generalizable Robot Control","repo_url":"https://github.com/buoyancy99/large-video-planner","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":605856,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2828173,"paper_url":"https://arxiv.org/abs/2512.15840","paper_title":"Large Video Planner Enables Generalizable Robot Control","repo_url":"https://github.com/nerfies/nerfies.github.io","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
