Where to Add PDE Diffusion in Transformers
Abstract
Transformers enable powerful content-based global routing via self-attention, but they lack an explicit local geometric prior along the sequence axis. As a result, the placement of locality-inducing modules in hybrid architectures has largely been empirical. We study a simple deterministic PDE diffusion layer implemented as one explicit Euler step of one-dimensional heat smoothing using a discrete Neumann Laplacian under a spectral stability constraint, and ask a structural question: where should diffusion be inserted relative to attention? Our central claim is that diffusion and attention generally do not commute, so inserting the same local operator before versus after attention leads to qualitatively different behaviors. We develop a three-layer operator-theoretic framework that (1) establishes unconditional guarantees for the diffusion subsystem, including spectral non-expansiveness and monotone Dirichlet-energy dissipation when the diffusion step size is smaller than one half, (2) derives compositional perturbation bounds linking insertion effects to representation roughness and downstream amplification, and (3) uses diffusion-attention non-commutativity as a diagnostic for structural double-mixing conflicts. Guided by theory, we evaluate seven insertion positions on the Long Range Arena benchmark. Early diffusion acts as effective pre-regularization, improving average accuracy by 4.1 percentage points when applied after embedding, while post-attention diffusion degrades performance by 2.5 percentage points, consistent with the predicted conflict. A multi-scale diffusion variant yields consistent gains under the same global stability constraint. Our analysis provides a general template for reasoning about local-global compositions in sequence models by separating provable guarantees, compositional bounds, and mechanistic diagnostics.