{"ID":2921148,"CreatedAt":"2026-06-02T02:42:49.606572591Z","UpdatedAt":"2026-06-04T06:21:04.369492701Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.01757","arxiv_id":"2606.01757","title":"PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection","abstract":"Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.","short_abstract":"Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradig...","url_abs":"https://arxiv.org/abs/2606.01757","url_pdf":"https://arxiv.org/pdf/2606.01757v1","authors":"[\"Smit Kadvani\",\"Shriya Gumber\",\"Kriti Faujdar\",\"Harsh Dave\"]","published":"2026-06-01T06:22:47Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Transformer\"]","has_code":false}
