{"ID":2841731,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.11313","arxiv_id":"2511.11313","title":"DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding","abstract":"Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82\\% fewer visual tokens, 75\\% fewer parameters, and 71\\% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code and Model are available in https://github.com/Tanveer81/DocSLM.git.","short_abstract":"Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-do...","url_abs":"https://arxiv.org/abs/2511.11313","url_pdf":"https://arxiv.org/pdf/2511.11313v3","authors":"[\"Tanveer Hannan\",\"Dimitrios Mallios\",\"Parth Pathak\",\"Faegheh Sardari\",\"Thomas Seidl\",\"Gedas Bertasius\",\"Mohsen Fayyaz\",\"Sunando Sengupta\"]","published":"2025-11-14T13:56:39Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":607078,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2841731,"paper_url":"https://arxiv.org/abs/2511.11313","paper_title":"DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding","repo_url":"https://github.com/Tanveer81/DocSLM.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
