Post-Norm can Resharpen Attention

cs.LG arXiv:2510.08341
View PDF arXiv JSON

Abstract

Length Generalization is the essential capacity of autonomous agents to perform tasks in longer contexts than those encountered during training. To systematically study this feat, we test how well models can approximate the next token distributions in algorithmic tasks. This is to take into account the realistic possibility of multiple next tokens being legal. We present a prototypical benchmark for this line of study: in the Set Complement Task, the model needs to output a uniform distribution over tokens not in the input. We prove a theorem that states simple transformers can length generalize on this task, however, with performance degradation due to attention dispersion. A mechanistic reading of how dispersion takes effect lets us discover a remedy: Post-Norm can Resharpen Attention. We present experimental evidence to support this idea. We also show that Exponential Moving Averages can help the issue of noisy gradients that arises when many next tokens are legal. We validate the general applicability of our proposed methods on a suite of formal language experiments. Our source code will be available upon publication.

PDF Viewer