{"ID":3084691,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T20:54:36.964885582Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05445","arxiv_id":"2606.05445","title":"Brick-Composer: Using MLLMs for Assembly with Diverse Bricks","abstract":"We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate brick assembly as a sequential decision-making problem, where each step involves two subtasks: brick selection, identifying the target brick from candidate components, and brick pose estimation, predicting where and how the selected brick should be placed. To support this study, we introduce BC-Bench (Brick Construction Benchmark), the first benchmark for evaluating MLLMs on assembly with diverse bricks. Experiments show that current state-of-the-art MLLMs remain far from reliable builders, struggling with fine-grained brick selection and failing at precise pose estimation. To bridge this gap, we propose Brick-Composer, a learning framework that equips MLLMs with assembly skills through three complementary signals: Human Design Sparks, which provide affordance-rich construction demonstrations; World Feedback, which grounds predicted actions in visual and physical consequences; and Synthetic Experience, which scales learning beyond existing object designs. Brick-Composer improves brick selection accuracy by over three times, substantially reduces pose estimation errors, and raises strict step-level assembly success from less than 1% to around 15%. After training, a Qwen-3-8B can correctly compose up to 42% of the steps for a complete object, suggesting that MLLMs can acquire assembly capabilities through targeted, physically grounded learning.","short_abstract":"We dream of AI agents that can read arbitrary designs and construct real-world objects from reusable building blocks. As a first step toward this vision, we study whether multimodal large language models (MLLMs) possess the visual grounding and spatial reasoning capabilities required for brick assembly. We formulate br...","url_abs":"https://arxiv.org/abs/2606.05445","url_pdf":"https://arxiv.org/pdf/2606.05445v1","authors":"[\"Jiateng Liu\",\"Bingxuan Li\",\"Zhenhailong Wang\",\"Rushi Wang\",\"Kaiwen Hong\",\"Cheng Qian\",\"Jiayu Liu\",\"Denghui Zhang\",\"Katherine Driggs-Campbell\",\"Manling Li\",\"Heng Ji\"]","published":"2026-06-03T21:08:06Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}