{"ID":2892797,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.14555","arxiv_id":"2507.14555","title":"Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions","abstract":"Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes. Our code and data are publicly available at https://github.com/jintangxue/Descrip3D.","short_abstract":"Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactio...","url_abs":"https://arxiv.org/abs/2507.14555","url_pdf":"https://arxiv.org/pdf/2507.14555v2","authors":"[\"Jintang Xue\",\"Ganning Zhao\",\"Jie-En Yao\",\"Hong-En Chen\",\"Yue Hu\",\"Meida Chen\",\"Suya You\",\"C. -C. Jay Kuo\"]","published":"2025-07-19T09:19:16Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":612018,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2892797,"paper_url":"https://arxiv.org/abs/2507.14555","paper_title":"Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions","repo_url":"https://github.com/jintangxue/Descrip3D","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}