{"ID":2861821,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00458","arxiv_id":"2510.00458","title":"VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors","abstract":"Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progress in TTA for vision-language classification, TTA for VLODs remains largely unexplored. The only prior method relies on a mean-teacher framework that introduces significant latency and memory overhead. To this end, we introduce \\textsc{VLOD-TTA}, a TTA method that leverages dense proposal overlap and image-conditioned prompts to adapt VLODs with low additional overhead. \\textsc{VLOD-TTA} combines (i) an IoU-weighted entropy objective that emphasizes spatially coherent proposal clusters and mitigates confirmation bias from isolated boxes, and (ii) image-conditioned prompt selection that ranks prompts by image-level compatibility and aggregates the most informative prompt scores for detection. Our experiments across diverse distribution shifts, including artistic domains, adverse driving conditions, low-light imagery, and common corruptions, indicate that \\textsc{VLOD-TTA} consistently outperforms standard TTA baselines and the prior state-of-the-art method using YOLO-World and Grounding DINO. Code : https://github.com/imatif17/VLOD-TTA","short_abstract":"Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO exhibit strong zero-shot generalization, but their performance degrades under distribution shift. Test-time adaptation (TTA) offers a practical way to adapt models online using only unlabeled target data. However, despite substantial progres...","url_abs":"https://arxiv.org/abs/2510.00458","url_pdf":"https://arxiv.org/pdf/2510.00458v2","authors":"[\"Atif Belal\",\"Heitor R. Medeiros\",\"Marco Pedersoli\",\"Eric Granger\"]","published":"2025-10-01T03:17:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":608843,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2861821,"paper_url":"https://arxiv.org/abs/2510.00458","paper_title":"VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors","repo_url":"https://github.com/imatif17/VLOD-TTA","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
