{"ID":2862045,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00806","arxiv_id":"2510.00806","title":"From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation","abstract":"Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-world physics. Second, these trajectories guide video generation through attention-based mechanisms for fine-grained motion refinement. We build a trajectory prediction dataset based on video tracking data with realistic motion patterns. Experiments on UCF-101 and MSR-VTT demonstrate that TrajVLM-Gen outperforms existing methods, achieving competitive FVD scores of 545 on UCF-101 and 539 on MSR-VTT.","short_abstract":"Current video generation models produce physically inconsistent motion that violates real-world dynamics. We propose TrajVLM-Gen, a two-stage framework for physics-aware image-to-video generation. First, we employ a Vision Language Model to predict coarse-grained motion trajectories that maintain consistency with real-...","url_abs":"https://arxiv.org/abs/2510.00806","url_pdf":"https://arxiv.org/pdf/2510.00806v1","authors":"[\"Fan Yang\",\"Zhiyang Chen\",\"Yousong Zhu\",\"Xin Li\",\"Jinqiao Wang\"]","published":"2025-10-01T12:11:36Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}