{"ID":3006064,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.02836","arxiv_id":"2606.02836","title":"Fast Transformer Inference on ARM-Based HMPSoCs","abstract":"Transformer models have set new performance standards for machine learning (ML) tasks. However, their resource-intensive deployment on resource-constrained edge devices for cloud-free, on-chip transformer inference remains challenging. The ARM Compute Library (ARM-CL) framework provides low-latency CNN inference on ARM-based edge devices but lacks support for transformer inference. In this work, we implement several new transformer kernels in ARM-CL to support native transformer execution. Our extended ARM-CL achieves up to three times faster transformer inference compared to state-of-the-art CPU/GPU implementations on an ARM-based embedded board. Furthermore, heterogeneous multi-processor system-on-chips (HMPSoCs) powering edge devices provide both embedded CPUs and GPUs. We introduce cooperative CPU-GPU transformer inference, which executes memory-intensive operations on the CPU while utilizing the GPU for highly parallelizable, compute-intensive operations. This cooperative execution, implemented with minimal overhead, further reduces transformer inference latency by up to 15.72% compared to the best single-processor inference on ARM-CL.","short_abstract":"Transformer models have set new performance standards for machine learning (ML) tasks. However, their resource-intensive deployment on resource-constrained edge devices for cloud-free, on-chip transformer inference remains challenging. The ARM Compute Library (ARM-CL) framework provides low-latency CNN inference on ARM...","url_abs":"https://arxiv.org/abs/2606.02836","url_pdf":"https://arxiv.org/pdf/2606.02836v1","authors":"[\"Hang Xu\",\"Yixian Shen\",\"Thanassis Giannetsos\",\"Anuj Pathania\"]","published":"2026-06-01T20:00:04Z","proceeding":"cs.AR","tasks":"[\"cs.AR\"]","methods":"[\"Transformer\",\"Convolutional Neural Network\"]","has_code":false}
