{"ID":2847019,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00925","arxiv_id":"2511.00925","title":"Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval","abstract":"The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.","short_abstract":"The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in s...","url_abs":"https://arxiv.org/abs/2511.00925","url_pdf":"https://arxiv.org/pdf/2511.00925v1","authors":"[\"Hanwen Su\",\"Ge Song\",\"Jiyan Wang\",\"Yuanbo Zhu\"]","published":"2025-11-02T13:07:48Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}