{"ID":2830721,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09646","arxiv_id":"2512.09646","title":"VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification","abstract":"Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.","short_abstract":"Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: spar...","url_abs":"https://arxiv.org/abs/2512.09646","url_pdf":"https://arxiv.org/pdf/2512.09646v2","authors":"[\"Wanyue Zhang\",\"Lin Geng Foo\",\"Thabo Beeler\",\"Rishabh Dabral\",\"Christian Theobalt\"]","published":"2025-12-10T13:40:24Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","project_urls":"[\"https://vcai.mpi-inf.mpg.de/projects/vhoi/\"]","has_code":false,"code_links":[{"ID":606067,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2830721,"paper_url":"https://arxiv.org/abs/2512.09646","paper_title":"VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification","repo_url":"https://github.com/rosettawyzhang/VHOI","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}