{"ID":2836139,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.22787","arxiv_id":"2511.22787","title":"World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models","abstract":"In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.","short_abstract":"In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and exam...","url_abs":"https://arxiv.org/abs/2511.22787","url_pdf":"https://arxiv.org/pdf/2511.22787v2","authors":"[\"Eunsu Kim\",\"Junyeong Park\",\"Na Min An\",\"Junseong Kim\",\"Hitesh Laxmichand Patel\",\"Jiho Jin\",\"Julia Kruk\",\"Amit Agarwal\",\"Srikant Panda\",\"Fenal Ashokbhai Ilasariya\",\"Hyunjung Shim\",\"Alice Oh\"]","published":"2025-11-27T22:23:08Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false}