{"ID":3083635,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-07T03:21:39.539466367Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.06322","arxiv_id":"2606.06322","title":"DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions","abstract":"GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.","short_abstract":"GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data re...","url_abs":"https://arxiv.org/abs/2606.06322","url_pdf":"https://arxiv.org/pdf/2606.06322v1","authors":"[\"Nathan Bout\",\"Maxime Langevin\",\"Ronan Riochet\"]","published":"2026-06-04T15:57:29Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[]","has_code":false}