{"ID":2872987,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.07596","arxiv_id":"2509.07596","title":"Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation","abstract":"Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.","short_abstract":"Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backg...","url_abs":"https://arxiv.org/abs/2509.07596","url_pdf":"https://arxiv.org/pdf/2509.07596v2","authors":"[\"Yusuke Hirota\",\"Ryo Hachiuma\",\"Boyi Li\",\"Ximing Lu\",\"Michael Ross Boone\",\"Boris Ivanovic\",\"Yejin Choi\",\"Marco Pavone\",\"Yu-Chiang Frank Wang\",\"Noa Garcia\",\"Yuta Nakashima\",\"Chao-Han Huck Yang\"]","published":"2025-09-09T11:14:11Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}