{"ID":2881122,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.12916","arxiv_id":"2508.12916","title":"RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph","abstract":"Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete scene visibility, which limits adaptability and incurs high hardware costs. We present \\textbf{RoboRetriever}, a novel framework for real-world object retrieval that operates using only a \\textbf{single} wrist-mounted RGB-D camera and free-form natural language instructions. RoboRetriever grounds visual observations to build and update a \\textbf{dynamic hierarchical scene graph} that encodes object semantics, geometry, and inter-object relations over time. The supervisor module reasons over this memory and task instruction to infer the target object and coordinate an integrated action module combining \\textbf{active perception}, \\textbf{interactive perception}, and \\textbf{manipulation}. To enable task-aware scene-grounded active perception, we introduce a novel visual prompting scheme that leverages large reasoning vision-language models to determine 6-DoF camera poses aligned with the semantic task goal and geometry scene context. We evaluate RoboRetriever on diverse real-world object retrieval tasks, including scenarios with human intervention, demonstrating strong adaptability and robustness in cluttered scenes with only one RGB-D camera.","short_abstract":"Humans effortlessly retrieve objects in cluttered, partially observable environments by combining visual reasoning, active viewpoint adjustment, and physical interaction-with only a single pair of eyes. In contrast, most existing robotic systems rely on carefully positioned fixed or multi-camera setups with complete sc...","url_abs":"https://arxiv.org/abs/2508.12916","url_pdf":"https://arxiv.org/pdf/2508.12916v1","authors":"[\"Hecheng Wang\",\"Jiankun Ren\",\"Jia Yu\",\"Lizhe Qi\",\"Yunquan Sun\"]","published":"2025-08-18T13:31:23Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Language Model\"]","has_code":false}