{"ID":2849516,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.23095","arxiv_id":"2510.23095","title":"Revisiting Multimodal Positional Encoding in Vision-Language Models","abstract":"Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.","short_abstract":"Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation....","url_abs":"https://arxiv.org/abs/2510.23095","url_pdf":"https://arxiv.org/pdf/2510.23095v3","authors":"[\"Jie Huang\",\"Xuejing Liu\",\"Sibo Song\",\"Ruibing Hou\",\"Hong Chang\",\"Junyang Lin\",\"Shuai Bai\"]","published":"2025-10-27T08:00:46Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false,"code_links":[{"ID":607710,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2849516,"paper_url":"https://arxiv.org/abs/2510.23095","paper_title":"Revisiting Multimodal Positional Encoding in Vision-Language Models","repo_url":"https://github.com/JJJYmmm/Multimodal-RoPEs","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
