{"ID":2849772,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23574","arxiv_id":"2510.23574","title":"More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models","abstract":"Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play-and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and improve the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of-the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE","short_abstract":"Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degradation in the image generation capability of the pre-trained model. We introduce...","url_abs":"https://arxiv.org/abs/2510.23574","url_pdf":"https://arxiv.org/pdf/2510.23574v1","authors":"[\"Hongkai Lin\",\"Dingkang Liang\",\"Mingyang Du\",\"Xin Zhou\",\"Xiang Bai\"]","published":"2025-10-27T17:44:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\"]","has_code":false,"code_links":[{"ID":607742,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2849772,"paper_url":"https://arxiv.org/abs/2510.23574","paper_title":"More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models","repo_url":"https://github.com/H-EmbodVis/MERGE","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
