{"ID":2826158,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.19159","arxiv_id":"2512.19159","title":"OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions","abstract":"Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: https://OmniMoGen.github.io/.","short_abstract":"Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a un...","url_abs":"https://arxiv.org/abs/2512.19159","url_pdf":"https://arxiv.org/pdf/2512.19159v1","authors":"[\"Wendong Bu\",\"Kaihang Pan\",\"Yuze Lin\",\"Jiacheng Li\",\"Kai Shen\",\"Wenqiao Zhang\",\"Juncheng Li\",\"Jun Xiao\",\"Siliang Tang\"]","published":"2025-12-22T08:55:23Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\",\"Variational Autoencoder\"]","has_code":false}
