{"ID":2921742,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-07T03:54:17.966829144Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01247","arxiv_id":"2606.01247","title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","abstract":"Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image -- and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at https://github.com/aim-uofa/TVRBench.","short_abstract":"Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) -- an active task where an agent adjusts its...","url_abs":"https://arxiv.org/abs/2606.01247","url_pdf":"https://arxiv.org/pdf/2606.01247v1","authors":"[\"Liyang Li\",\"Muzhi Zhu\",\"Zhiyue Zhao\",\"Hengyu Zhao\",\"Ke Liu\",\"Linhao Zhong\",\"Hao Chen\",\"Chunhua Shen\"]","published":"2026-05-31T14:00:10Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"LoRA\"]","has_code":false,"code_links":[{"ID":612598,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-02T02:42:49.606572591Z","DeletedAt":null,"paper_id":2921742,"paper_url":"https://arxiv.org/abs/2606.01247","paper_title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","repo_url":"https://github.com/aim-uofa/TVRBench","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
