{"ID":2900862,"CreatedAt":"2026-06-01T05:51:17.9442275Z","UpdatedAt":"2026-06-01T06:23:29.641557848Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2605.30917","arxiv_id":"2605.30917","title":"Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search","abstract":"As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.","short_abstract":"As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve s...","url_abs":"https://arxiv.org/abs/2605.30917","url_pdf":"https://arxiv.org/pdf/2605.30917v1","authors":"[\"Gyu-Hwung Cho\",\"Youngjune Lee\",\"Kiyoon Jeong\",\"Siyoung Lee\",\"Sanggyu Han\",\"Hervé Dejean\",\"Stéphane Clinchant\",\"Seung-won Hwang\"]","published":"2026-05-29T07:01:45Z","proceeding":"cs.IR","tasks":"[\"cs.IR\",\"cs.CV\"]","methods":"[]","has_code":false,"code_links":[{"ID":612533,"CreatedAt":"2026-06-01T05:51:17.9442275Z","UpdatedAt":"2026-06-01T05:51:17.9442275Z","DeletedAt":null,"paper_id":2900862,"paper_url":"https://arxiv.org/abs/2605.30917","paper_title":"Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search","repo_url":"https://github.com/naver/v-splade","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}