{"ID":2843363,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.08173","arxiv_id":"2511.08173","title":"VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion","abstract":"Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \\ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \\ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.","short_abstract":"Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \\ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Speci...","url_abs":"https://arxiv.org/abs/2511.08173","url_pdf":"https://arxiv.org/pdf/2511.08173v1","authors":"[\"Samet Hicsonmez\",\"Abd El Rahman Shabayek\",\"Djamila Aouada\"]","published":"2025-11-11T12:37:38Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607203,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2843363,"paper_url":"https://arxiv.org/abs/2511.08173","paper_title":"VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion","repo_url":"https://github.com/giddyyupp/VLMDiff","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
