{"ID":2824463,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.23675","arxiv_id":"2512.23675","title":"End-to-End Test-Time Training for Long Context","abstract":"We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.","short_abstract":"We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, c...","url_abs":"https://arxiv.org/abs/2512.23675","url_pdf":"https://arxiv.org/pdf/2512.23675v2","authors":"[\"Arnuv Tandon\",\"Karan Dalal\",\"Xinhao Li\",\"Daniel Koceja\",\"Marcel Rød\",\"Sam Buchanan\",\"Xiaolong Wang\",\"Jure Leskovec\",\"Sanmi Koyejo\",\"Tatsunori Hashimoto\",\"Carlos Guestrin\",\"Jed McCaleb\",\"Yejin Choi\",\"Yu Sun\"]","published":"2025-12-29T18:30:14Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Language Model\"]","has_code":false}
