{"ID":2834938,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.00960","arxiv_id":"2512.00960","title":"Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction","abstract":"Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately and scalably extracting 4D interaction data from these in-the-wild videos remains a significant and unsolved challenge. To overcome the annotation bottleneck, we introduce an efficient sparse contact annotation paradigm. To scale this process, we develop InterPoint, a multi-modal predictor that drives a human-in-the-loop data engine. Building upon these efficiently acquired annotations, we introduce 4DHOISolver, a novel optimization framework that constrains the ill-posed 4D HOI reconstruction problem, maintaining high spatio-temporal coherence and physical plausibility. Leveraging this framework, we introduce Open4DHOI, a new large-scale 4D HOI dataset featuring a diverse catalog of 135 object types and 133 actions. Furthermore, we demonstrate the effectiveness of our reconstructions by enabling an RL-based agent to imitate the recovered motions. Data and code will be publicly available at https://github.com/wenboran2002/open4dhoi_code.","short_abstract":"Generalized robots must learn from diverse, large-scale human-object interactions (HOI) to operate robustly in the real world. Monocular internet videos offer a nearly limitless and readily available source of data, capturing an unparalleled diversity of human activities, objects, and environments. However, accurately...","url_abs":"https://arxiv.org/abs/2512.00960","url_pdf":"https://arxiv.org/pdf/2512.00960v3","authors":"[\"Boran Wen\",\"Ye Lu\",\"Sirui Wang\",\"Keyan Wan\",\"Jiahong Zhou\",\"Junxuan Liang\",\"Xinpeng Liu\",\"Bang Xiao\",\"Ruiyang Liu\",\"Yong-Lu Li\"]","published":"2025-11-30T16:21:47Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":606462,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2834938,"paper_url":"https://arxiv.org/abs/2512.00960","paper_title":"Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction","repo_url":"https://github.com/wenboran2002/open4dhoi_code","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}