{"ID":2837246,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.18670","arxiv_id":"2511.18670","title":"Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers","abstract":"Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.","short_abstract":"Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (D...","url_abs":"https://arxiv.org/abs/2511.18670","url_pdf":"https://arxiv.org/pdf/2511.18670v1","authors":"[\"Rowan Bradbury\",\"Aniket Srinivasan Ashok\",\"Sai Ram Kasanagottu\",\"Gunmay Jhingran\",\"Shuai Meng\"]","published":"2025-11-24T00:55:14Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false}
