{"ID":2855577,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.12190","arxiv_id":"2510.12190","title":"Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos","abstract":"Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.","short_abstract":"Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2C...","url_abs":"https://arxiv.org/abs/2510.12190","url_pdf":"https://arxiv.org/pdf/2510.12190v1","authors":"[\"Shingo Yokoi\",\"Kento Sasaki\",\"Yu Yamaguchi\"]","published":"2025-10-14T06:36:41Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Language Model\"]","has_code":false,"code_links":[{"ID":608264,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2855577,"paper_url":"https://arxiv.org/abs/2510.12190","paper_title":"Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos","repo_url":"https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
