{"ID":2839426,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.15197","arxiv_id":"2511.15197","title":"Insert In Style: A Zero-Shot Generative Framework for Harmonious Cross-Domain Object Composition","abstract":"Reference-based object composition involves integrating foreground reference image with background scene to produce harmonious fused image. This task becomes particularly challenging in cross-domain scenarios, where models must balance preserving the reference object's identity while harmonizing them to match stylized environments. This under-explored problem is currently split between practical \"blenders\" that lack generative fidelity and \"generators\" that require impractical, per-subject online finetuning. In this work, we introduce Insert In Style, the first zero-shot generative framework that is both practical and high-fidelity. Our core contribution is a unified framework with two key innovations: (i) a novel multi-stage training protocol that disentangles representations for identity, style, and composition, and (ii) a specialized masked-attention architecture that surgically enforces this disentanglement during generation (iii) A prior preservation objective that keeps learned identity and style priors intact. By design, this approach mitigates concept interference typical in unified-attention architectures while ensuring robust generalization across diverse references and styles. Our framework is trained on a new 115k sample dataset, curated from a novel data pipeline. This pipeline couples large-scale generation with a rigorous, iterative human-in-the-loop filtering process to ensure both high-fidelity semantic identity and style coherence. Unlike prior work, our model is truly zero-shot and requires no text prompts. We also introduce a new public benchmark for stylized composition. We demonstrate state-of-the-art performance, significantly outperforming existing methods on both identity and style metrics, a result strongly corroborated by user studies.","short_abstract":"Reference-based object composition involves integrating foreground reference image with background scene to produce harmonious fused image. This task becomes particularly challenging in cross-domain scenarios, where models must balance preserving the reference object's identity while harmonizing them to match stylized...","url_abs":"https://arxiv.org/abs/2511.15197","url_pdf":"https://arxiv.org/pdf/2511.15197v2","authors":"[\"Raghu Vamsi Chittersu\",\"Yuvraj Singh Rathore\",\"Pranav Adlinge\",\"Kunal Swami\"]","published":"2025-11-19T07:33:00Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[]","has_code":false}