{"ID":2880876,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.14278","arxiv_id":"2508.14278","title":"GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting","abstract":"3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA distills a scene-specific 3D instance feature field via self-supervised contrastive learning. To extend to generalized language feature fields, we introduce the core contribution of GALA, a cross-attention module with two learnable codebooks that encode view-independent semantic embeddings. This design not only ensures intra-instance feature similarity but also supports seamless 2D and 3D open-vocabulary queries. It reduces memory consumption by avoiding per-Gaussian high-dimensional feature learning. Extensive experiments on real-world datasets demonstrate GALA's remarkable open-vocabulary performance on both 2D and 3D.","short_abstract":"3D scene reconstruction and understanding have gained increasing popularity, yet existing methods still struggle to capture fine-grained, language-aware 3D representations from 2D images. In this paper, we present GALA, a novel framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). GALA...","url_abs":"https://arxiv.org/abs/2508.14278","url_pdf":"https://arxiv.org/pdf/2508.14278v2","authors":"[\"Elena Alegret\",\"Kunyi Li\",\"Sen Wang\",\"Siyun Liang\",\"Michael Niemeyer\",\"Stefano Gasperini\",\"Nassir Navab\",\"Federico Tombari\"]","published":"2025-08-19T21:26:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}