{"ID":2858820,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.07092","arxiv_id":"2510.07092","title":"Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report","abstract":"World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.","short_abstract":"World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecas...","url_abs":"https://arxiv.org/abs/2510.07092","url_pdf":"https://arxiv.org/pdf/2510.07092v1","authors":"[\"Riccardo Mereu\",\"Aidan Scannell\",\"Yuxin Hou\",\"Yi Zhao\",\"Aditya Jitta\",\"Antonio Dominguez\",\"Luigi Acerbi\",\"Amos Storkey\",\"Paul Chang\"]","published":"2025-10-08T14:49:12Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.RO\"]","methods":"[\"Transformer\",\"LoRA\"]","has_code":false}