{"ID":2837315,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18746","arxiv_id":"2511.18746","title":"Any4D: Open-Prompt 4D Generation from Natural Language and Images","abstract":"While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a \\textit{\"GPT moment\"} in the embodied domain. There is a naive observation: \\textit{the diversity of embodied data far exceeds the relatively small space of possible primitive motions}. Based on this insight, we propose \\textbf{Primitive Embodied World Models} (PEWM), which restricts video generation to fixed shorter horizons, our approach \\textit{1) enables} fine-grained alignment between linguistic concepts and visual representations of robotic actions, \\textit{2) reduces} learning complexity, \\textit{3) improves} data efficiency in embodied data collection, and \\textit{4) decreases} inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.","short_abstract":"While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actio...","url_abs":"https://arxiv.org/abs/2511.18746","url_pdf":"https://arxiv.org/pdf/2511.18746v2","authors":"[\"Hao Li\",\"Qiao Sun\"]","published":"2025-11-24T04:17:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false}
