{"ID":2846474,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.01187","arxiv_id":"2511.01187","title":"Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs","abstract":"Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \\corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8{,}400 structured debate prompts spanning four sensitive domains -- Women's Rights, Backwardness, Terrorism, and Religion -- across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude~3.5~Haiku, DeepSeek-Chat, and LLaMA-3-70B), we generate over 100{,}000 debate responses and automatically classify which demographic groups are assigned stereotyped versus modern roles. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to Terrorism and Religion ($\\geq$89\\%), Africans to socioeconomic ``backwardness'' (up to 77\\%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our \\corpusname benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.","short_abstract":"Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce \\corpusname, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset inc...","url_abs":"https://arxiv.org/abs/2511.01187","url_pdf":"https://arxiv.org/pdf/2511.01187v4","authors":"[\"Muhammed Saeed\",\"Muhammad Abdul-mageed\",\"Shady Shehata\"]","published":"2025-11-03T03:25:40Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.CY\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
