{"ID":2831809,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.07810","arxiv_id":"2512.07810","title":"Auditing Games for Sandbagging","abstract":"Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbagging. Second, a blue team used black-box, model-internals, or training-based approaches to identify sandbagging models. We found that the blue team could not reliably discriminate sandbaggers from benign models. Black-box approaches were defeated by effective imitation of a weaker model. Linear probes, a model-internals approach, showed more promise but their naive application was vulnerable to behaviours instilled by the red team. We also explored capability elicitation as a strategy for detecting sandbagging. Although Prompt-based elicitation was not reliable, training-based elicitation consistently elicited full performance from the sandbagging models, using only a single correct demonstration of the evaluation task. However the performance of benign models was sometimes also raised, so relying on elicitation as a detection strategy was prone to false-positives. In the short-term, we recommend developers remove potential sandbagging using on-distribution training for elicitation. In the longer-term, further research is needed to ensure the efficacy of training-based elicitation, and develop robust methods for sandbagging detection. We open source our model organisms at https://github.com/AI-Safety-Institute/sandbagging_auditing_games and select transcripts and results at https://huggingface.co/datasets/sandbagging-games/evaluation_logs . A demo illustrating the game can be played at https://sandbagging-demo.far.ai/ .","short_abstract":"Future AI systems could conceal their capabilities ('sandbagging') during evaluations, potentially misleading developers and auditors. We stress-tested sandbagging detection techniques using an auditing game. First, a red team fine-tuned five models, some of which conditionally underperformed, as a proxy for sandbaggin...","url_abs":"https://arxiv.org/abs/2512.07810","url_pdf":"https://arxiv.org/pdf/2512.07810v1","authors":"[\"Jordan Taylor\",\"Sid Black\",\"Dillon Bowen\",\"Thomas Read\",\"Satvik Golechha\",\"Alex Zelenka-Martin\",\"Oliver Makins\",\"Connor Kissane\",\"Kola Ayonrinde\",\"Jacob Merizian\",\"Samuel Marks\",\"Chris Cundy\",\"Joseph Bloom\"]","published":"2025-12-08T18:44:44Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Generative Adversarial Network\"]","project_urls":"[\"https://sandbagging-demo.far.ai/\"]","has_code":false,"code_links":[{"ID":606170,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2831809,"paper_url":"https://arxiv.org/abs/2512.07810","paper_title":"Auditing Games for Sandbagging","repo_url":"https://github.com/AI-Safety-Institute/sandbagging_auditing_games","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
