{"ID":2864966,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21797","arxiv_id":"2509.21797","title":"MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation","abstract":"Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach combines motion-aware latent world model features with pixel-space features, enabling MoWM to emphasize action-relevant visual details for action decoding. Extensive evaluations on the CALVIN and real-world manipulation tasks demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.","short_abstract":"Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and...","url_abs":"https://arxiv.org/abs/2509.21797","url_pdf":"https://arxiv.org/pdf/2509.21797v3","authors":"[\"Yangcheng Yu\",\"Xin Jin\",\"Yu Shang\",\"Xin Zhang\",\"Haisheng Su\",\"Wei Wu\",\"Yong Li\"]","published":"2025-09-26T02:54:36Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":609222,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864966,"paper_url":"https://arxiv.org/abs/2509.21797","paper_title":"MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation","repo_url":"https://github.com/tsinghua-fib-lab/MoWM","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}