{"ID":2846972,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.00833","arxiv_id":"2511.00833","title":"Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials","abstract":"Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n \u003c\u003c N. VCA first distils each head's dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.","short_abstract":"Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce V...","url_abs":"https://arxiv.org/abs/2511.00833","url_pdf":"https://arxiv.org/pdf/2511.00833v1","authors":"[\"Yifan Pu\",\"Jixuan Ying\",\"Qixiu Li\",\"Tianzhu Ye\",\"Dongchen Han\",\"Xiaochen Wang\",\"Ziyi Wang\",\"Xinyu Shao\",\"Gao Huang\",\"Xiu Li\"]","published":"2025-11-02T07:04:12Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\"]","methods":"[\"Vision Transformer\",\"Diffusion Model\",\"Transformer\"]","has_code":false,"code_links":[{"ID":607476,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2846972,"paper_url":"https://arxiv.org/abs/2511.00833","paper_title":"Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials","repo_url":"https://github.com/LeapLabTHU/LinearDiff","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
