{"ID":2884743,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.06291","arxiv_id":"2508.06291","title":"Real-Time 3D Vision-Language Embedding Mapping","abstract":"A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding distribution, with a confidence-weighted 3D integration for more reliable 3D embeddings. The resulting metric-accurate embedding representation is task-agnostic and can represent semantic concepts on a global multi-room, as well as on a local object-level. This enables a variety of interactive robotic applications that require the localisation of objects-of-interest via natural language. We evaluate our approach on a variety of real-world sequences and demonstrate that these strategies achieve a more accurate object-of-interest localisation while improving the runtime performance in order to meet our real-time constraints. We further demonstrate the versatility of our approach in a variety of interactive handheld, mobile robotics and manipulation tasks, requiring only raw image data.","short_abstract":"A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding dis...","url_abs":"https://arxiv.org/abs/2508.06291","url_pdf":"https://arxiv.org/pdf/2508.06291v1","authors":"[\"Christian Rauch\",\"Björn Ellensohn\",\"Linus Nwankwo\",\"Vedant Dave\",\"Elmar Rueckert\"]","published":"2025-08-08T13:11:54Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Language Model\"]","has_code":false}