{"ID":2833632,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.04032","arxiv_id":"2512.04032","title":"jina-vlm: Small Multilingual Vision Language Model","abstract":"We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of arbitrary-resolution images. To understand the contribution of different training data categories, we conduct a leave-one-out data mixture ablation study-systematically removing task, domain, modality, and language categories-to diagnose which data types are necessary versus redundant and whether task benefits transfer across domains. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm.","short_abstract":"We present jina-vlm, a token-efficient 2.4B parameter vision-language model that achieves state-of-the-art multilingual VQA performance among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language decoder and makes use of image tiling and attention-pooling for token-efficient processing of...","url_abs":"https://arxiv.org/abs/2512.04032","url_pdf":"https://arxiv.org/pdf/2512.04032v3","authors":"[\"Andreas Koukounas\",\"Georgios Mastrapas\",\"Florian Hönicke\",\"Sedigheh Eslami\",\"Guillaume Roncari\",\"Scott Martens\",\"Han Xiao\"]","published":"2025-12-03T18:13:41Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\",\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false}