{"ID":2895464,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.09338","arxiv_id":"2507.09338","title":"Simplifying Traffic Anomaly Detection with Video Foundation Models","abstract":"Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple yet flexible architectures to outperform specialized designs. Therefore, in this work, we investigate an architecturally simple encoder-only approach using plain Video Vision Transformers (Video ViTs) and study how pre-training enables strong TAD performance. We find that: (i) advanced pre-training enables simple encoder-only models to match or even surpass the performance of specialized state-of-the-art TAD methods, while also being significantly more efficient; (ii) although weakly- and fully-supervised pre-training are advantageous on standard benchmarks, we find them less effective for TAD. Instead, self-supervised Masked Video Modeling (MVM) provides the strongest signal; and (iii) Domain-Adaptive Pre-Training (DAPT) on unlabeled driving videos further improves downstream performance, without requiring anomalous examples. Our findings highlight the importance of pre-training and show that effective, efficient, and scalable TAD models can be built with minimal architectural complexity. We release our code, domain-adapted encoders, and fine-tuned models to support future work: https://github.com/tue-mps/simple-tad.","short_abstract":"Recent methods for ego-centric Traffic Anomaly Detection (TAD) often rely on complex multi-stage or multi-representation fusion architectures, yet it remains unclear whether such complexity is necessary. Recent findings in visual perception suggest that foundation models, enabled by advanced pre-training, allow simple...","url_abs":"https://arxiv.org/abs/2507.09338","url_pdf":"https://arxiv.org/pdf/2507.09338v2","authors":"[\"Svetlana Orlova\",\"Tommie Kerssies\",\"Brunó B. Englert\",\"Gijs Dubbelman\"]","published":"2025-07-12T16:36:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false,"code_links":[{"ID":612199,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2895464,"paper_url":"https://arxiv.org/abs/2507.09338","paper_title":"Simplifying Traffic Anomaly Detection with Video Foundation Models","repo_url":"https://github.com/tue-mps/simple-tad","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}