{"ID":2876414,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.00361","arxiv_id":"2509.00361","title":"Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation","abstract":"Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce {GVF-TAPE}, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipulation. GVF-TAPE employs a generative video model to predict future RGB-D frames from a single side-view RGB image and a task description, offering visual plans that guide robot actions. A decoupled pose estimation model then extracts end-effector poses from the predicted frames, translating them into executable commands via low-level controllers. By iteratively integrating video foresight and pose estimation in a closed loop, GVF-TAPE achieves real-time, adaptive manipulation across a broad range of tasks. Extensive experiments in both simulation and real-world settings demonstrate that our approach reduces reliance on task-specific action data and generalizes effectively, providing a practical and scalable solution for intelligent robotic systems.","short_abstract":"Robotic manipulation in unstructured environments requires systems that can generalize across diverse tasks while maintaining robust and reliable performance. We introduce {GVF-TAPE}, a closed-loop framework that combines generative visual foresight with task-agnostic pose estimation to enable scalable robotic manipula...","url_abs":"https://arxiv.org/abs/2509.00361","url_pdf":"https://arxiv.org/pdf/2509.00361v1","authors":"[\"Chuye Zhang\",\"Xiaoxiong Zhang\",\"Wei Pan\",\"Linfang Zheng\",\"Wei Zhang\"]","published":"2025-08-30T04:53:32Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false}