{"ID":3053368,"CreatedAt":"2026-06-04T04:41:36.695875263Z","UpdatedAt":"2026-06-06T03:31:06.711308811Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.04413","arxiv_id":"2606.04413","title":"(Mis)generalization of Helpful-only Fine-tuning","abstract":"Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R\u0026D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.","short_abstract":"Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R\u0026D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their ha...","url_abs":"https://arxiv.org/abs/2606.04413","url_pdf":"https://arxiv.org/pdf/2606.04413v1","authors":"[\"Mohammad Omar Khursheed\",\"Baram Sosis\",\"Fabien Roger\"]","published":"2026-06-03T03:43:08Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[]","has_code":false}
