{"ID":2842630,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.10691","arxiv_id":"2511.10691","title":"Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models","abstract":"The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce \\textsc{Squid Game}, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, including instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition in performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. We also compare prominent LLM benchmarks and \\textsc{Squid Game}, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. Project page: https://github.com/zijianchen98/LLM_Squid_Game.","short_abstract":"The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, w...","url_abs":"https://arxiv.org/abs/2511.10691","url_pdf":"https://arxiv.org/pdf/2511.10691v2","authors":"[\"Zijian Chen\",\"Wenjun Zhang\",\"Guangtao Zhai\"]","published":"2025-11-12T06:06:29Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607142,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2842630,"paper_url":"https://arxiv.org/abs/2511.10691","paper_title":"Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models","repo_url":"https://github.com/zijianchen98/LLM_Squid_Game","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}