{"ID":2863914,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.03295","arxiv_id":"2510.03295","title":"Multimodal Arabic Captioning with Interpretable Visual Concept Integration","abstract":"We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and Jina V4, each evaluated separately for label retrieval. A hybrid vocabulary is built from training captions and enriched with about 21K general domain labels translated from the Visual Genome dataset, covering objects, attributes, and scenes. The top-k retrieved labels are transformed into fluent Arabic prompts and passed along with the original image to vision-language models. In the second stage, we tested Qwen-VL and Gemini Pro Vision for caption generation, resulting in six encoder-decoder configurations. The results show that mCLIP + Gemini Pro Vision achieved the best BLEU-1 (5.34%) and cosine similarity (60.01%), while AraCLIP + Qwen-VL obtained the highest LLM-judge score (36.33%). This interpretable pipeline enables culturally coherent and contextually accurate Arabic captions.","short_abstract":"We present VLCAP, an Arabic image captioning framework that integrates CLIP-based visual label retrieval with multimodal text generation. Rather than relying solely on end-to-end captioning, VLCAP grounds generation in interpretable Arabic visual concepts extracted with three multilingual encoders, mCLIP, AraCLIP, and...","url_abs":"https://arxiv.org/abs/2510.03295","url_pdf":"https://arxiv.org/pdf/2510.03295v1","authors":"[\"Passant Elchafei\",\"Amany Fashwan\"]","published":"2025-09-29T18:52:38Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
