{"ID":2899626,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.00707","arxiv_id":"2507.00707","title":"BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving","abstract":"Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.","short_abstract":"Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for au...","url_abs":"https://arxiv.org/abs/2507.00707","url_pdf":"https://arxiv.org/pdf/2507.00707v1","authors":"[\"Zeming Chen\",\"Hang Zhao\"]","published":"2025-07-01T12:10:11Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Diffusion Model\",\"Transformer\",\"Variational Autoencoder\"]","has_code":false,"code_links":[{"ID":612503,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2899626,"paper_url":"https://arxiv.org/abs/2507.00707","paper_title":"BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving","repo_url":"https://github.com/Czm369/bev-vae","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}