{"ID":2840413,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13047","arxiv_id":"2511.13047","title":"DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation","abstract":"Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-attention mechanisms and insufficiently model intra- and inter-modal feature relationships, resulting in imprecise feature alignment and limited discriminative representation. To address these challenges, we propose DiffPixelFormer, a differential pixel-aware Transformer for RGB-D indoor scene segmentation that simultaneously enhances intra-modal representations and models inter-modal interactions. At its core, the Intra-Inter Modal Interaction Block (IIMIB) captures intra-modal long-range dependencies via self-attention and models inter-modal interactions with the Differential-Shared Inter-Modal (DSIM) module to disentangle modality-specific and shared cues, enabling fine-grained, pixel-level cross-modal alignment. Furthermore, a dynamic fusion strategy balances modality contributions and fully exploits RGB-D information according to scene characteristics. Extensive experiments on the SUN RGB-D and NYUDv2 benchmarks demonstrate that DiffPixelFormer-L achieves mIoU scores of 54.28% and 59.95%, outperforming DFormer-L by 1.78% and 2.75%, respectively. Code is available at https://github.com/gongyan1/DiffPixelFormer.","short_abstract":"Indoor semantic segmentation is fundamental to computer vision and robotics, supporting applications such as autonomous navigation, augmented reality, and smart environments. Although RGB-D fusion leverages complementary appearance and geometric cues, existing methods often depend on computationally intensive cross-att...","url_abs":"https://arxiv.org/abs/2511.13047","url_pdf":"https://arxiv.org/pdf/2511.13047v1","authors":"[\"Yan Gong\",\"Jianli Lu\",\"Yongsheng Gao\",\"Jie Zhao\",\"Xiaojuan Zhang\",\"Susanto Rahardja\"]","published":"2025-11-17T06:51:07Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.RO\"]","methods":"[\"Transformer\"]","has_code":false,"code_links":[{"ID":606966,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2840413,"paper_url":"https://arxiv.org/abs/2511.13047","paper_title":"DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation","repo_url":"https://github.com/gongyan1/DiffPixelFormer","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}