{"ID":2863566,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.24713","arxiv_id":"2509.24713","title":"Circuit-Aware Reward Training: A Mechanistic Framework for Longtail Robustness in RLHF","abstract":"Reinforcement Learning from Human Feedback (RLHF) reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment. We propose a mechanistic interpretability framework that identifies specialized neural circuits responsible for rare-event processing in reward models. Drawing from recent advances showing distributed specialization for rare tokens in language models\\citep{liu2025no, liu2025emergent}, we hypothesize that reward models also develop functionally distinct circuits for longtail scenarios. Our theoretical framework establishes formal connections between circuit specialization, reward generalization bounds, and longtail performance. We introduce \\textbf{Circuit-Aware Reward Training (CART)}, which uses circuit analysis to guide data augmentation, regularization, and ensemble strategies. This approach provides both theoretical insights into reward model failures and practical interventions for improving longtail robustness.","short_abstract":"Reinforcement Learning from Human Feedback (RLHF) reward models exhibit systematic failures on longtail distributions, leading to reward hacking and misalignment. We propose a mechanistic interpretability framework that identifies specialized neural circuits responsible for rare-event processing in reward models. Drawi...","url_abs":"https://arxiv.org/abs/2509.24713","url_pdf":"https://arxiv.org/pdf/2509.24713v1","authors":"[\"Jing Liu\"]","published":"2025-09-29T12:42:14Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Reinforcement Learning\",\"Language Model\",\"RLHF\"]","has_code":false}
