{"ID":2869536,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.15357","arxiv_id":"2509.15357","title":"MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation","abstract":"Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion models improve global interactions but at substantially higher compute and memory cost. In parallel, compositional reliability remains weak: models often mix attributes across objects, violate spatial relations, or omit requested entities, and these errors are not reliably reflected by global metrics such as FID or CLIP-based scores. To address these issues without changing the SDXL pipeline, we propose MaskAttn-SDXL, a plug-in module that injects token-conditioned spatial gating into cross-attention logits before softmax. The gating sparsifies token-to-location interactions to suppress irrelevant bindings while preserving the pretrained backbone and standard sampling process, requiring no external supervision or inference-time editing.","short_abstract":"Diffusion models have achieved strong results in text-to-image generation, but important limitations remain as prompts become more structured and multi-object. On the architecture side, U-Net backbones are efficient and stable, yet their locality makes global coordination harder, while Transformer-based diffusion model...","url_abs":"https://arxiv.org/abs/2509.15357","url_pdf":"https://arxiv.org/pdf/2509.15357v2","authors":"[\"Yu Chang\",\"Jiahao Chen\",\"Anzhe Cheng\",\"Paul Bogdan\"]","published":"2025-09-18T18:57:47Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}