{"ID":2834326,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.01302","arxiv_id":"2512.01302","title":"DCText: Scheduled Attention Masking for Visual Text Generation via Divide-and-Conquer Strategy","abstract":"Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-Modal Diffusion Transformers. Our method first decomposes a prompt by extracting and dividing the target text, then assigns each to a designated region. To accurately render each segment within their regions while preserving overall image coherence, we introduce two attention masks - Text-Focus and Context-Expansion - applied sequentially during denoising. Additionally, Localized Noise Initialization further improves text accuracy and region alignment without increasing computational cost. Extensive experiments on single- and multisentence benchmarks show that DCText achieves the best text accuracy without compromising image quality while also delivering the lowest generation latency.","short_abstract":"Despite recent text-to-image models achieving highfidelity text rendering, they still struggle with long or multiple texts due to diluted global attention. We propose DCText, a training-free visual text generation method that adopts a divide-and-conquer strategy, leveraging the reliable short-text generation of Multi-M...","url_abs":"https://arxiv.org/abs/2512.01302","url_pdf":"https://arxiv.org/pdf/2512.01302v2","authors":"[\"Jaewoo Song\",\"Jooyoung Choi\",\"Kanghyun Baek\",\"Sangyub Lee\",\"Daemin Park\",\"Sungroh Yoon\"]","published":"2025-12-01T05:52:55Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}