{"ID":2869505,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15212","arxiv_id":"2509.15212","title":"RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation","abstract":"This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.","short_abstract":"This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulat...","url_abs":"https://arxiv.org/abs/2509.15212","url_pdf":"https://arxiv.org/pdf/2509.15212v1","authors":"[\"Yuming Jiang\",\"Siteng Huang\",\"Shengke Xue\",\"Yaxi Zhao\",\"Jun Cen\",\"Sicong Leng\",\"Kehan Li\",\"Jiayan Guo\",\"Kexiang Wang\",\"Mingxiu Chen\",\"Fan Wang\",\"Deli Zhao\",\"Xin Li\"]","published":"2025-09-18T17:58:02Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.RO\"]","methods":"[\"Variational Autoencoder\"]","has_code":false}
