{"ID":2834962,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.01008","arxiv_id":"2512.01008","title":"LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency","abstract":"Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware Low-Rank Adaptation (LoRA) layers and reusing a frozen SAM-3D reconstructor. During training we exploit off-the-shelf RGB-D sequences and their camera poses to build a differentiable reprojection loss that enforces cross-view agreement without requiring any additional 3D-text supervision. The resulting masks are concatenated with RGB images to form RGBA prompts for SAM-3D, which outputs Gaussian splats or textured meshes without retraining. Across ScanRefer and Nr3D, LISA-3D improves language-to-3D accuracy by up to +15.6 points over single-view baselines while adapting only 11.6M parameters. The system is modular, data-efficient, and supports zero-shot deployment on unseen categories, providing a practical recipe for language-guided 3D content creation. Our code will be available at https://github.com/binisalegend/LISA-3D.","short_abstract":"Text-driven 3D reconstruction demands a mask generator that simultaneously understands open-vocabulary instructions and remains consistent across viewpoints. We present LISA-3D, a two-stage framework that lifts language-image segmentation into 3D by retrofitting the instruction-following model LISA with geometry-aware...","url_abs":"https://arxiv.org/abs/2512.01008","url_pdf":"https://arxiv.org/pdf/2512.01008v1","authors":"[\"Zhongbin Guo\",\"Jiahe Liu\",\"Wenyu Gao\",\"Yushan Li\",\"Chengzhi Li\",\"Ping Jian\"]","published":"2025-11-30T18:02:14Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"LoRA\"]","has_code":false,"code_links":[{"ID":606465,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2834962,"paper_url":"https://arxiv.org/abs/2512.01008","paper_title":"LISA-3D: Lifting Language-Image Segmentation to 3D via Multi-View Consistency","repo_url":"https://github.com/binisalegend/LISA-3D","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}