{"ID":2842722,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.09141","arxiv_id":"2511.09141","title":"RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation","abstract":"Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the Recurrent Geometric-prior Multimodal Policy (RGMP), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill sequences for unseen scenes with minimal spatial common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot-object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5x greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.","short_abstract":"Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise...","url_abs":"https://arxiv.org/abs/2511.09141","url_pdf":"https://arxiv.org/pdf/2511.09141v2","authors":"[\"Xuetao Li\",\"Wenke Huang\",\"Nengyuan Pan\",\"Kaiyan Zhao\",\"Songhua Yang\",\"Yiming Wang\",\"Mengde Li\",\"Mang Ye\",\"Jifeng Xuan\",\"Miao Li\"]","published":"2025-11-12T09:29:08Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Language Model\"]","has_code":false}