{"ID":2891056,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.18742","arxiv_id":"2507.18742","title":"Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement","abstract":"Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\\% of cases, the SSC process reduces this vulnerability by over 90\\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .","short_abstract":"Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to ident...","url_abs":"https://arxiv.org/abs/2507.18742","url_pdf":"https://arxiv.org/pdf/2507.18742v1","authors":"[\"Víctor Gallego\"]","published":"2025-07-24T18:44:28Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":611843,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2891056,"paper_url":"https://arxiv.org/abs/2507.18742","paper_title":"Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement","repo_url":"https://github.com/vicgalle/specification-self-correction","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
