{"ID":2858367,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.08567","arxiv_id":"2510.08567","title":"MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning","abstract":"Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.","short_abstract":"Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centr...","url_abs":"https://arxiv.org/abs/2510.08567","url_pdf":"https://arxiv.org/pdf/2510.08567v3","authors":"[\"Tajamul Ashraf\",\"Umair Nawaz\",\"Abdelrahman M. Shaker\",\"Rao Anwer\",\"Philip Torr\",\"Fahad Shahbaz Khan\",\"Salman Khan\"]","published":"2025-10-09T17:59:54Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.AI\",\"cs.CL\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608540,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2858367,"paper_url":"https://arxiv.org/abs/2510.08567","paper_title":"MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning","repo_url":"https://github.com/mbzuai-oryx/MATRIX","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}