{"ID":2841292,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.12155","arxiv_id":"2511.12155","title":"Rethinking Deep Alignment Through The Lens Of Incomplete Learning","abstract":"Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fails to transform model preferences in later response regions fully. We introduce base-favored tokens -- vocabulary elements where base models assign higher probability than aligned models -- as computational indicators of incomplete safety learning and develop a targeted completion method that addresses undertrained regions through adaptive penalties and hybrid teacher distillation. Experimental evaluation across Llama and Qwen model families demonstrates dramatic improvements in adversarial robustness, with 48--98% reductions in attack success rates while preserving general capabilities. These results establish both a mechanistic understanding and practical solutions for fundamental limitations in safety alignment methodologies.","short_abstract":"Large language models exhibit systematic vulnerabilities to adversarial attacks despite extensive safety alignment. We provide a mechanistic analysis revealing that position-dependent gradient weakening during autoregressive training creates signal decay, leading to incomplete safety learning where safety training fail...","url_abs":"https://arxiv.org/abs/2511.12155","url_pdf":"https://arxiv.org/pdf/2511.12155v1","authors":"[\"Thong Bach\",\"Dung Nguyen\",\"Thao Minh Le\",\"Truyen Tran\"]","published":"2025-11-15T10:53:03Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
