{"ID":2830662,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2512.09510","arxiv_id":"2512.09510","title":"ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics","abstract":"Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.","short_abstract":"Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Hea...","url_abs":"https://arxiv.org/abs/2512.09510","url_pdf":"https://arxiv.org/pdf/2512.09510v1","authors":"[\"Donato Caramia\",\"Florian T. Pokorny\",\"Giuseppe Triggiani\",\"Denis Ruffino\",\"David Naso\",\"Paolo Roberto Massenio\"]","published":"2025-12-10T10:34:43Z","proceeding":"cs.RO","tasks":"[\"cs.RO\",\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false}
