{"ID":2852033,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.18279","arxiv_id":"2510.18279","title":"Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs","abstract":"Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.","short_abstract":"Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are...","url_abs":"https://arxiv.org/abs/2510.18279","url_pdf":"https://arxiv.org/pdf/2510.18279v2","authors":"[\"Yanhong Li\",\"Zixuan Lan\",\"Jiawei Zhou\"]","published":"2025-10-21T04:07:20Z","proceeding":"cs.CL","tasks":"[\"cs.CL\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\",\"Convolutional Neural Network\"]","has_code":false}
