{"ID":2877934,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.18722","arxiv_id":"2508.18722","title":"VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft","abstract":"Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.","short_abstract":"Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. Th...","url_abs":"https://arxiv.org/abs/2508.18722","url_pdf":"https://arxiv.org/pdf/2508.18722v3","authors":"[\"Honghao Fu\",\"Junlong Ren\",\"Qi Chai\",\"Deheng Ye\",\"Yujun Cai\",\"Hao Wang\"]","published":"2025-08-26T06:39:15Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
