{"ID":2839695,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.15703","arxiv_id":"2511.15703","title":"Think Visually, Reason Textually: Vision-Language Synergy in ARC","abstract":"Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.","short_abstract":"Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificia...","url_abs":"https://arxiv.org/abs/2511.15703","url_pdf":"https://arxiv.org/pdf/2511.15703v2","authors":"[\"Beichen Zhang\",\"Yuhang Zang\",\"Xiaoyi Dong\",\"Yuhang Cao\",\"Haodong Duan\",\"Dahua Lin\",\"Jiaqi Wang\"]","published":"2025-11-19T18:59:04Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":606897,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2839695,"paper_url":"https://arxiv.org/abs/2511.15703","paper_title":"Think Visually, Reason Textually: Vision-Language Synergy in ARC","repo_url":"https://github.com/InternLM/ARC-VL","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}