{"ID":2848686,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25713","arxiv_id":"2510.25713","title":"Robotic Assistant: Completing Collaborative Tasks with Dexterous Vision-Language-Action Models","abstract":"We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (iii) action-space post-processing that predicts compact deltas (position/rotation) and PCA-reduced finger joints before mapping to full commands. Using a multi-view, teleoperated Franka and Mimic-hand dataset augmented with MediaPipe hand poses, we demonstrate that delta actions are well-behaved and that four principal components explain ~96% of hand-joint variance. Ablations identify action post-processing as the primary performance driver; auxiliary intent helps, FiLM is mixed, and a directional motion loss is detrimental. A real-time stack (~0.3 s latency on one RTX 4090) composes \"pick-up\" and \"pass\" into a long-horizon behavior. We surface \"trainer overfitting\" to specific demonstrators as the key limitation.","short_abstract":"We adapt a pre-trained Vision-Language-Action (VLA) model (Open-VLA) for dexterous human-robot collaboration with minimal language prompting. Our approach adds (i) FiLM conditioning to visual backbones for task-aware perception, (ii) an auxiliary intent head that predicts collaborator hand pose and target cues, and (ii...","url_abs":"https://arxiv.org/abs/2510.25713","url_pdf":"https://arxiv.org/pdf/2510.25713v1","authors":"[\"Boshi An\",\"Chenyu Yang\",\"Robert Katzschmann\"]","published":"2025-10-29T17:22:59Z","proceeding":"cs.RO","tasks":"[\"cs.RO\"]","methods":"[]","has_code":false}
