{"ID":2846655,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.01472","arxiv_id":"2511.01472","title":"AERMANI-VLM: Structured Prompting and Reasoning for Aerial Manipulation with Vision Language Models","abstract":"The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the generated actions are often inconsistent, hallucination-prone, and dynamically infeasible for flight. In this work, we present AERMANI-VLM, the first framework to adapt pretrained VLMs for aerial manipulation by separating high-level reasoning from low-level control, without any task-specific fine-tuning. Our framework encodes natural language instructions, task context, and safety constraints into a structured prompt that guides the model to generate a step-by-step reasoning trace in natural language. This reasoning output is used to select from a predefined library of discrete, flight-safe skills, ensuring interpretable and temporally consistent execution. By decoupling symbolic reasoning from physical action, AERMANI-VLM mitigates hallucinated commands and prevents unsafe behavior, enabling robust task completion. We validate the framework in both simulation and hardware on diverse multi-step pick-and-place tasks, demonstrating strong generalization to previously unseen commands, objects, and environments.","short_abstract":"The rapid progress of vision--language models (VLMs) has sparked growing interest in robotic control, where natural language can express the operation goals while visual feedback links perception to action. However, directly deploying VLM-driven policies on aerial manipulators remains unsafe and unreliable since the ge...","url_abs":"https://arxiv.org/abs/2511.01472","url_pdf":"https://arxiv.org/pdf/2511.01472v1","authors":"[\"Sarthak Mishra\",\"Rishabh Dev Yadav\",\"Avirup Das\",\"Saksham Gupta\",\"Wei Pan\",\"Spandan Roy\"]","published":"2025-11-03T11:31:55Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[\"Language Model\"]","has_code":false}
