{"ID":2867382,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.21388","arxiv_id":"2509.21388","title":"TUN3D: Towards Real-World Scene Understanding from Unposed Images","abstract":"Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .","short_abstract":"Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cam...","url_abs":"https://arxiv.org/abs/2509.21388","url_pdf":"https://arxiv.org/pdf/2509.21388v1","authors":"[\"Anton Konushin\",\"Nikita Drozdov\",\"Bulat Gabdullin\",\"Alexey Zakharov\",\"Anna Vorontsova\",\"Danila Rukhovich\",\"Maksim Kolodiazhnyi\"]","published":"2025-09-23T20:24:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"eess.IV\"]","methods":"[]","has_code":false,"code_links":[{"ID":609458,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2867382,"paper_url":"https://arxiv.org/abs/2509.21388","paper_title":"TUN3D: Towards Real-World Scene Understanding from Unposed Images","repo_url":"https://github.com/col14m/tun3d","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}