{"ID":2857200,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.15963","arxiv_id":"2510.15963","title":"ESCA: Contextualizing Embodied Agents via Scene-Graph Generation","abstract":"Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.","short_abstract":"Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challen...","url_abs":"https://arxiv.org/abs/2510.15963","url_pdf":"https://arxiv.org/pdf/2510.15963v2","authors":"[\"Jiani Huang\",\"Amish Sethi\",\"Matthew Kuo\",\"Mayank Keoliya\",\"Neelay Velingker\",\"JungHo Jung\",\"Ser-Nam Lim\",\"Ziyang Li\",\"Mayur Naik\"]","published":"2025-10-11T20:13:59Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608424,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2857200,"paper_url":"https://arxiv.org/abs/2510.15963","paper_title":"ESCA: Contextualizing Embodied Agents via Scene-Graph Generation","repo_url":"https://github.com/video-fm/LASER","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0},{"ID":608425,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2857200,"paper_url":"https://arxiv.org/abs/2510.15963","paper_title":"ESCA: Contextualizing Embodied Agents via Scene-Graph Generation","repo_url":"https://github.com/video-fm/ESCA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}