{"ID":2864731,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.23316","arxiv_id":"2509.23316","title":"C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection","abstract":"Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \\textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.","short_abstract":"Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robu...","url_abs":"https://arxiv.org/abs/2509.23316","url_pdf":"https://arxiv.org/pdf/2509.23316v2","authors":"[\"Siheng Wang\",\"Zhengdao Li\",\"Yanshu Li\",\"Canran Xiao\",\"Haibo Zhan\",\"Zhengtao Yao\",\"Xuzhi Zhang\",\"Jiale Kang\",\"Linshan Li\",\"Weiming Liu\",\"Zhikang Dong\",\"Jifeng Shen\",\"Junhao Dong\",\"Qiang Sun\",\"Piotr Koniusz\"]","published":"2025-09-27T14:04:15Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":609193,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2864731,"paper_url":"https://arxiv.org/abs/2509.23316","paper_title":"C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection","repo_url":"https://github.com/justin-herry/C3-OWD.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}