{"ID":2824653,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.22969","arxiv_id":"2512.22969","title":"CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision","abstract":"Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.","short_abstract":"Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight paral...","url_abs":"https://arxiv.org/abs/2512.22969","url_pdf":"https://arxiv.org/pdf/2512.22969v1","authors":"[\"Behnam Raoufi\",\"Hossein Sharify\",\"Mohamad Mahdee Ramezanee\",\"Khosrow Hajsadeghi\",\"Saeed Bagheri Shouraki\"]","published":"2025-12-28T15:21:20Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Convolutional Neural Network\"]","has_code":false}
