{"ID":2899417,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.00334","arxiv_id":"2507.00334","title":"Populate-A-Scene: Affordance-Aware Human Video Generation","abstract":"Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the scene, while ensuring coherent behavior, appearance, harmonization, and scene affordance. Unlike prior work, we infer human affordance for video generation (i.e., where to insert a person and how they should behave) from a single scene image, without explicit conditions like bounding boxes or body poses. An in-depth study of cross-attention heatmaps demonstrates that we can uncover the inherent affordance perception of a pre-trained video model without labeled affordance datasets.","short_abstract":"Can a video generation model be repurposed as an interactive world simulator? We explore the affordance perception potential of text-to-video models by teaching them to predict human-environment interaction. Given a scene image and a prompt describing human actions, we fine-tune the model to insert a person into the sc...","url_abs":"https://arxiv.org/abs/2507.00334","url_pdf":"https://arxiv.org/pdf/2507.00334v1","authors":"[\"Mengyi Shan\",\"Zecheng He\",\"Haoyu Ma\",\"Felix Juefei-Xu\",\"Peizhao Zhang\",\"Tingbo Hou\",\"Ching-Yao Chuang\"]","published":"2025-07-01T00:21:24Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}