{"ID":2829181,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.13898","arxiv_id":"2512.13898","title":"Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs","abstract":"Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.","short_abstract":"Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale perfor...","url_abs":"https://arxiv.org/abs/2512.13898","url_pdf":"https://arxiv.org/pdf/2512.13898v1","authors":"[\"Rachit Bansal\",\"Aston Zhang\",\"Rishabh Tiwari\",\"Lovish Madaan\",\"Sai Surya Duvvuri\",\"Devvrit Khatri\",\"David Brandfonbrener\",\"David Alvarez-Melis\",\"Prajjwal Bhargava\",\"Mihir Sanjay Kale\",\"Samy Jelassi\"]","published":"2025-12-15T21:01:37Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false}