{"ID":2879094,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.16943","arxiv_id":"2508.16943","title":"LHM-Humanoid: Learning a Unified Policy for Long-Horizon Humanoid Whole-Body Loco-Manipulation in Diverse Messy Environments","abstract":"We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes. In our setting, multiple objects are displaced from their intended locations and may obstruct navigation; a humanoid agent must repeatedly (i) walk to a target, (ii) pick it up with diverse whole-body postures under balance constraints, (iii) carry it while navigating around obstacles, and (iv) place it at a designated goal -- all within a single continuous episode and without any environment reset. This task simultaneously demands cross-scene generalization and unified one-policy control: layouts, obstacle arrangements, object category/mass/shape/color and object start/goal poses vary substantially even within a room category, requiring a single general policy that directly outputs actions rather than invoking pre-trained skill libraries. Our dataset spans four room types (bedroom, living room, kitchen, and warehouse), comprising 350 diverse scenes/tasks with 79 objects (25 movable targets). Since no scene-specific ground-truth motion sequences are provided, we learn goal-conditioned teacher policies via reinforcement learning and distill them into a single end-to-end student policy using DAgger. We further distill this unified policy into a vision-language-action (VLA) model driven by egocentric RGB observations and natural language. Experiments in Isaac Gym demonstrate that LHM-Humanoid substantially outperforms end-to-end RL baselines and prior humanoid loco-manipulation methods on both seen and unseen scenes, exhibiting strong long-horizon robustness and cross-scene generalization.","short_abstract":"We introduce LHM-Humanoid, a benchmark and learning framework for long-horizon whole-body humanoid loco-manipulation in diverse, cluttered scenes. In our setting, multiple objects are displaced from their intended locations and may obstruct navigation; a humanoid agent must repeatedly (i) walk to a target, (ii) pick it...","url_abs":"https://arxiv.org/abs/2508.16943","url_pdf":"https://arxiv.org/pdf/2508.16943v2","authors":"[\"Haozhuo Zhang\",\"Jingkai Sun\",\"Michele Caprio\",\"Jian Tang\",\"Shanghang Zhang\",\"Qiang Zhang\",\"Wei Pan\"]","published":"2025-08-23T08:23:14Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\"]","has_code":false}