{"ID":2883002,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.09224","arxiv_id":"2508.09224","title":"From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training","abstract":"Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittleness for prompts with obscured user intent. Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology or cybersecurity), where a user request can be answered safely at a high level, but in some cases can lead to malicious uplift if sufficiently detailed or actionable. As an alternative, we propose safe-completions: a safety-training approach that centers on the safety of the assistant's output, rather than a binary classification of the user's intent. Safe-completions seek to maximize helpfulness within the safety policy's constraints. We incorporated this approach into GPT-5 and find that across both production comparisons and internally controlled experiments, safe-completion training improves safety (especially on dual-use prompts), reduces the severity of residual safety failures, and substantially increases model helpfulness.","short_abstract":"Large Language Models used in ChatGPT have traditionally been trained to learn a refusal boundary: depending on the user's intent, the model is taught to either fully comply or outright refuse. While this is a strong mitigation for explicitly malicious prompts, focusing safety training on refusals can lead to brittlene...","url_abs":"https://arxiv.org/abs/2508.09224","url_pdf":"https://arxiv.org/pdf/2508.09224v1","authors":"[\"Yuan Yuan\",\"Tina Sriskandarajah\",\"Anna-Luisa Brakman\",\"Alec Helyar\",\"Alex Beutel\",\"Andrea Vallone\",\"Saachi Jain\"]","published":"2025-08-12T00:18:23Z","proceeding":"cs.CY","tasks":"[\"cs.CY\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false}
