{"ID":2852932,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.17800","arxiv_id":"2510.17800","title":"Glyph: Scaling Context Windows via Visual-Text Compression","abstract":"Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.","short_abstract":"Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In thi...","url_abs":"https://arxiv.org/abs/2510.17800","url_pdf":"https://arxiv.org/pdf/2510.17800v2","authors":"[\"Jiale Cheng\",\"Yusen Liu\",\"Xinyu Zhang\",\"Yulin Fei\",\"Wenyi Hong\",\"Ruiliang Lyu\",\"Weihan Wang\",\"Zhe Su\",\"Xiaotao Gu\",\"Xiao Liu\",\"Yushi Bai\",\"Jie Tang\",\"Hongning Wang\",\"Minlie Huang\"]","published":"2025-10-20T17:58:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.CL\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":608045,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2852932,"paper_url":"https://arxiv.org/abs/2510.17800","paper_title":"Glyph: Scaling Context Windows via Visual-Text Compression","repo_url":"https://github.com/thu-coai/Glyph","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
