{"ID":2847658,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27558","arxiv_id":"2510.27558","title":"Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs","abstract":"This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.","short_abstract":"This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene gr...","url_abs":"https://arxiv.org/abs/2510.27558","url_pdf":"https://arxiv.org/pdf/2510.27558v1","authors":"[\"Sushil Samuel Dinesh\",\"Shinkyu Park\"]","published":"2025-10-31T15:42:32Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.AI\",\"cs.LG\"]","methods":"[]","has_code":false}