Hierarchical Context Learning of object components for unsupervised semantic segmentation

Pattern Recognition 2025 4
View PDF arXiv JSON

Abstract

Unsupervised Semantic Segmentation (USS) aims to learn semantically rich and dense representations without
relying on labels. Recent advances in self-supervised learning have demonstrated the potential of pretrained
vision transformers to capture patch-level semantic information, offering a promising direction to USS.
However, existing methods face challenges in constructing a discriminative spatial token embedding space that
consistently and effectively represents the well-structured semantic relationships among object components.
Inspired by Edwin Hancock’s pioneer work on hierarchical pattern analysis, we highlight the critical role
of hierarchical context to overcome this limitation. By modeling spatial relationships at multiple levels of
granularity, hierarchical context helps align related object parts while distinguishing them across semantic
groups. Based on this insight, we introduce Hierarchical Context Learning (HCL), a novel approach for USS that
enhances semantic consistency by integrating hierarchical context. HCL incorporates a novel parallel multi-level
vision transformer backbone to aggregate multi-level contextual information into object component tokens.
To uncover the semantic structure of objects, we propose Momentum-based Global Foreground–Background
Clustering (MoGoClustering) to cluster object components into coherent semantic groups and then calculate
their semantic centroids. To enforce intra-group semantic consistency and maximize inter-group separation
across spatial scales, we design a foreground–background-aware contrastive loss based on MoGoClustering.
Our method achieves state-of-the-art performance on the COCO-Stuff and Pascal VOC datasets, demonstrating
its ability to learn robust, context-aware, and discriminative object component semantics for USS. The code is
available at: https://github.com/dbaofd/HCL.

PDF Viewer