{"ID":2831223,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.08648","arxiv_id":"2512.08648","title":"Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank","abstract":"The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limitations: the reliance on external, pre-trained encoders introduces overhead and domain shift. A dispersed-based strategy that encourages strong separation among in-batch latent representations alleviates this specific dependency. To assess the effect of the number of negative samples in generative modeling, we propose {\\mname}, a plug-and-play training framework that requires no external encoders. Our method integrates a memory bank mechanism that maintains a large, dynamically updated queue of negative samples across training iterations. This decouples the number of negatives from the mini-batch size, providing abundant and high-quality negatives for a contrastive objective without a multiplicative increase in computational cost. A low-dimensional projection head is used to further minimize memory and bandwidth overhead. {\\mname} offers three principal advantages: (1) it is self-contained, eliminating dependency on pretrained vision foundation models and their associated forward-pass overhead; (2) it introduces no additional parameters or computational cost during inference; and (3) it enables substantially faster convergence, achieving superior generative quality more efficiently. On ImageNet-256, {\\mname} achieves a state-of-the-art FID of \\textbf{2.40} within 400k steps, significantly outperforming comparable methods.","short_abstract":"The dominance of denoising generative models (e.g., diffusion, flow-matching) in visual synthesis is tempered by their substantial training costs and inefficiencies in representation learning. While injecting discriminative representations via auxiliary alignment has proven effective, this approach still faces key limi...","url_abs":"https://arxiv.org/abs/2512.08648","url_pdf":"https://arxiv.org/pdf/2512.08648v2","authors":"[\"Shaofeng Zhang\",\"Xuanqi Chen\",\"Ning Liao\",\"Haoxiang Zhao\",\"Xiaoxing Wang\",\"Haoru Tan\",\"Sitong Wu\",\"Xiaosong Jia\",\"Qi Fan\",\"Junchi Yan\"]","published":"2025-12-09T14:39:26Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false}
