{"ID":2848386,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.25166","arxiv_id":"2510.25166","title":"A Study on Inference Latency for Vision Transformers on Mobile Devices","abstract":"Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neural networks (CNNs), we provide insights into the factors that influence the latency of ViT architectures on mobile devices. Based on these insights, we develop a dataset including measured latencies of 1000 synthetic ViTs with representative building blocks and state-of-the-art architectures from two machine learning frameworks and six mobile platforms. Using this dataset, we show that inference latency of new ViTs can be predicted with sufficient accuracy for real-world applications.","short_abstract":"Given the significant advances in machine learning techniques on mobile devices, particularly in the domain of computer vision, in this work we quantitatively study the performance characteristics of 190 real-world vision transformers (ViTs) on mobile devices. Through a comparison with 102 real-world convolutional neur...","url_abs":"https://arxiv.org/abs/2510.25166","url_pdf":"https://arxiv.org/pdf/2510.25166v2","authors":"[\"Zhuojin Li\",\"Marco Paolieri\",\"Leana Golubchik\"]","published":"2025-10-29T04:57:49Z","proceeding":"cs.CV","tasks":"[\"cs.CV\",\"cs.LG\",\"cs.PF\"]","methods":"[\"Vision Transformer\",\"Transformer\",\"Convolutional Neural Network\"]","has_code":false}
