{"ID":2922233,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-03T00:47:32.987482086Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.00954","arxiv_id":"2606.00954","title":"COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation","abstract":"Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.","short_abstract":"Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over object...","url_abs":"https://arxiv.org/abs/2606.00954","url_pdf":"https://arxiv.org/pdf/2606.00954v1","authors":"[\"Xinlong Zhang\",\"Jia Wei\",\"Xiaoyu Zhang\",\"Teng Zhou\",\"Chengyu Lin\",\"Yongchuan Tang\"]","published":"2026-05-31T02:10:34Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\"]","has_code":false}