{"ID":2851682,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.19496","arxiv_id":"2510.19496","title":"CARES: Context-Aware Resolution Selector for VLMs","abstract":"Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \\emph{CARES}-a \\textbf{C}ontext-\\textbf{A}ware \\textbf{R}esolution \\textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \\emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.","short_abstract":"Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \\emph{CARES}-a \\textbf{C}ontext-\\text...","url_abs":"https://arxiv.org/abs/2510.19496","url_pdf":"https://arxiv.org/pdf/2510.19496v2","authors":"[\"Moshe Kimhi\",\"Nimrod Shabtay\",\"Raja Giryes\",\"Chaim Baskin\",\"Eli Schwartz\"]","published":"2025-10-22T11:44:31Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
