{"ID":2843131,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.07820","arxiv_id":"2511.07820","title":"SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control","abstract":"Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.","short_abstract":"Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model ca...","url_abs":"https://arxiv.org/abs/2511.07820","url_pdf":"https://arxiv.org/pdf/2511.07820v3","authors":"[\"Zhengyi Luo\",\"Ye Yuan\",\"Tingwu Wang\",\"Chenran Li\",\"Fernando Castañeda\",\"Sirui Chen\",\"Zi-Ang Cao\",\"Jiefeng Li\",\"David Minor\",\"Qingwei Ben\",\"Jinhyung Park\",\"David Sami\",\"Zi Wang\",\"Xingye Da\",\"Runyu Ding\",\"Cyrus Hogg\",\"Lina Song\",\"Edy Lim\",\"Eugene Jeong\",\"Tairan He\",\"Haoru Xue\",\"Wenli Xiao\",\"Simon Yuen\",\"Jan Kautz\",\"Yan Chang\",\"Umar Iqbal\",\"Linxi \\\"Jim\\\" Fan\",\"Yuke Zhu\"]","published":"2025-11-11T04:37:40Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.CV\",\"cs.GR\",\"eess.SY\"]","methods":"[]","has_code":false}
