{"ID":2890559,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.19370","arxiv_id":"2507.19370","title":"BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving","abstract":"Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5\\% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.","short_abstract":"Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparen...","url_abs":"https://arxiv.org/abs/2507.19370","url_pdf":"https://arxiv.org/pdf/2507.19370v1","authors":"[\"Felix Brandstaetter\",\"Erik Schuetz\",\"Katharina Winter\",\"Fabian Flohr\"]","published":"2025-07-25T15:22:56Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Large Language Model\"]","has_code":false}
