{"ID":2871958,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.10707","arxiv_id":"2509.10707","title":"Understanding AI Evaluation Patterns: How Different GPT Models Assess Vision-Language Descriptions","abstract":"As AI systems increasingly evaluate other AI outputs, understanding their assessment behavior becomes crucial for preventing cascading biases. This study analyzes vision-language descriptions generated by NVIDIA's Describe Anything Model and evaluated by three GPT variants (GPT-4o, GPT-4o-mini, GPT-5) to uncover distinct \"evaluation personalities\" the underlying assessment strategies and biases each model demonstrates. GPT-4o-mini exhibits systematic consistency with minimal variance, GPT-4o excels at error detection, while GPT-5 shows extreme conservatism with high variability. Controlled experiments using Gemini 2.5 Pro as an independent question generator validate that these personalities are inherent model properties rather than artifacts. Cross-family analysis through semantic similarity of generated questions reveals significant divergence: GPT models cluster together with high similarity while Gemini exhibits markedly different evaluation strategies. All GPT models demonstrate a consistent 2:1 bias favoring negative assessment over positive confirmation, though this pattern appears family-specific rather than universal across AI architectures. These findings suggest that evaluation competence does not scale with general capability and that robust AI assessment requires diverse architectural perspectives.","short_abstract":"As AI systems increasingly evaluate other AI outputs, understanding their assessment behavior becomes crucial for preventing cascading biases. This study analyzes vision-language descriptions generated by NVIDIA's Describe Anything Model and evaluated by three GPT variants (GPT-4o, GPT-4o-mini, GPT-5) to uncover distin...","url_abs":"https://arxiv.org/abs/2509.10707","url_pdf":"https://arxiv.org/pdf/2509.10707v2","authors":"[\"Sajjad Abdoli\",\"Rudi Cilibrasi\",\"Rima Al-Shikh\"]","published":"2025-09-12T21:48:59Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CL\"]","methods":"[]","has_code":false}
