{"ID":2877677,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.21094","arxiv_id":"2508.21094","title":"EMCompress: Video-LLMs with Endomorphic Multimodal Compression","abstract":"Video-LLMs face a fundamental tension in long-video reasoning: static, sparse frame sampling either dilutes evidence across task-irrelevant segments at significant cost or misses fine-grained temporal semantics altogether. We propose a novel, cognitively-inspired task -- Endomorphic Multimodal Compression (EMC) -- as a structurally-constrained sufficient-statistic problem for VideoQA, and formulate it as an endomorphic transformation F_EMC : (V, Q) -\u003e (v, q) that compresses the multimodal input while preserving answer invariance across reasonable downstream models. The endomorphic form keeps the compressed output in the downstream pipeline's native task space -- a structural mirror of the filter-then-reason mechanism in the cognitive literature motivating EMC -- distinguishing it from latent-code compression (IB / VIB) and making the formulation extensible to other multimodal settings. Under the Markov chain A -\u003e (V, Q) -\u003e (v, q), EMC realizes the classical sufficiency condition I((v, q); A) = I((V, Q); A) in its VideoQA-natural form. As a modular front-end, EMC plugs into both Video Instruction Tuning and Video Question Answering pipelines. We release the first dedicated benchmark and propose ReSimplifyIt, an EMC baseline surpassing prior methods by 0.40 F-1 with competitive query rewriting. Integrating EMC yields relative gains of 7.33% in training and 33.7% in inference for video-language understanding.","short_abstract":"Video-LLMs face a fundamental tension in long-video reasoning: static, sparse frame sampling either dilutes evidence across task-irrelevant segments at significant cost or misses fine-grained temporal semantics altogether. We propose a novel, cognitively-inspired task -- Endomorphic Multimodal Compression (EMC) -- as a...","url_abs":"https://arxiv.org/abs/2508.21094","url_pdf":"https://arxiv.org/pdf/2508.21094v3","authors":"[\"Zheyu Fan\",\"Jiateng Liu\",\"Yuji Zhang\",\"Zihan Wang\",\"Yi R. Fung\",\"Manling Li\",\"Heng Ji\"]","published":"2025-08-27T14:33:32Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}
