BEV-VLM: Trajectory Planning via Unified BEV Abstraction

cs.RO arXiv:2509.25249
View PDF arXiv JSON

Abstract

This paper introduces BEV-VLM, a novel approach for trajectory planning in autonomous driving that leverages Vision-Language Models (VLMs) with Bird's-Eye View (BEV) feature maps as visual input. Unlike conventional trajectory planning approaches that rely solely on raw visual data (e.g., camera images), our method utilizes a highly compressed and informative BEV representation generated by fusing camera and LiDAR data, with subsequent alignment to High-Definition (HD) maps. This unified BEV-HD map format provides a geometrically consistent and semantically rich scene description, which enables VLMs to perform accurate and robust trajectory planning. Experimental results on the nuScenes dataset demonstrate that, compared with state-of-the-art vision-only methods, our approach achieves a 53.1% improvement in planning accuracy and realizes complete collision avoidance in evaluation scenarios. Our work highlights that VLMs can effectively interpret processed visual representations such as BEV features, expanding their applicability beyond raw image inputs for the task of trajectory planning.

PDF Viewer