{"ID":2889707,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.20964","arxiv_id":"2507.20964","title":"Core Safety Values for Provably Corrigible Agents","abstract":"We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward -- combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is *learned* to mean-squared error $\\varepsilon$ and the planner is $\\varepsilon$-sub-optimal, the probability of violating *any* safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon \"decidable island\" where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.","short_abstract":"We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five *structurally separate* utility heads -- deference, switch-access preservation, truthfulness, low-impact behavior via a belief-...","url_abs":"https://arxiv.org/abs/2507.20964","url_pdf":"https://arxiv.org/pdf/2507.20964v2","authors":"[\"Aran Nayebi\"]","published":"2025-07-28T16:19:25Z","proceeding":"cs.AI","tasks":"[\"cs.AI\",\"cs.CC\",\"cs.GT\",\"cs.LG\",\"cs.MA\"]","methods":"[\"RLHF\"]","has_code":false}
