{"ID":2856541,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.11834","arxiv_id":"2510.11834","title":"Don't Walk the Line: Boundary Guidance for Filtered Generation","abstract":"Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak, ambiguous, and longcontext prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.","short_abstract":"Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundar...","url_abs":"https://arxiv.org/abs/2510.11834","url_pdf":"https://arxiv.org/pdf/2510.11834v2","authors":"[\"Sarah Ball\",\"Andreas Haupt\"]","published":"2025-10-13T18:33:17Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Reinforcement Learning\",\"Large Language Model\"]","has_code":false}
