{"ID":2859907,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.04860","arxiv_id":"2510.04860","title":"Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails","abstract":"As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM agents. Unlike training-time failures, ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies. We formalize and analyze ATP through two complementary paradigms: Self-Interested Exploration, where repeated high-reward deviations induce individual behavioral drift, and Imitative Strategy Diffusion, where deviant behaviors spread across multi-agent systems. Building on these paradigms, we construct controllable testbeds and benchmark both open and closed-source LLMs. Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states. In multi-agent settings, successful violations diffuse quickly, leading to collective misalignment. Moreover, current reinforcement learning-based alignment methods provide limited defenses against alignment tipping. These findings demonstrate that alignment of LLM agents is not a static property but a fragile and dynamic one, vulnerable to feedback-driven decay during deployment. Our data and code are available at https://github.com/aiming-lab/ATP.","short_abstract":"As Large Language Model (LLM) agents increasingly gain self-evolutionary capabilities to adapt and refine their strategies through real-world interaction, their long-term reliability becomes a critical concern. We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving LLM...","url_abs":"https://arxiv.org/abs/2510.04860","url_pdf":"https://arxiv.org/pdf/2510.04860v2","authors":"[\"Siwei Han\",\"Kaiwen Xiong\",\"Jiaqi Liu\",\"Xinyu Ye\",\"Yaofeng Su\",\"Wenbo Duan\",\"Xinyuan Liu\",\"Cihang Xie\",\"Mohit Bansal\",\"Mingyu Ding\",\"Linjun Zhang\",\"Huaxiu Yao\"]","published":"2025-10-06T14:48:39Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Diffusion Model\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":608677,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2859907,"paper_url":"https://arxiv.org/abs/2510.04860","paper_title":"Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails","repo_url":"https://github.com/aiming-lab/ATP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
