{"ID":2866804,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.20579","arxiv_id":"2509.20579","title":"Large Pre-Trained Models for Bimanual Manipulation in 3D","abstract":"We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.","short_abstract":"We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are...","url_abs":"https://arxiv.org/abs/2509.20579","url_pdf":"https://arxiv.org/pdf/2509.20579v1","authors":"[\"Hanna Yurchyk\",\"Wei-Di Chang\",\"Gregory Dudek\",\"David Meger\"]","published":"2025-09-24T21:38:42Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.RO\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}