{"ID":2833899,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.02536","arxiv_id":"2512.02536","title":"WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens","abstract":"Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.","short_abstract":"Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, f...","url_abs":"https://arxiv.org/abs/2512.02536","url_pdf":"https://arxiv.org/pdf/2512.02536v1","authors":"[\"Jian Yang\",\"Dacheng Yin\",\"Xiaoxuan He\",\"Yong Li\",\"Fengyun Rao\",\"Jing Lyu\",\"Wei Zhai\",\"Yang Cao\",\"Zheng-Jun Zha\"]","published":"2025-12-02T09:02:20Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Large Language Model\",\"Language Model\",\"Variational Autoencoder\"]","has_code":false}