{"ID":2848941,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.24112","arxiv_id":"2510.24112","title":"Towards Efficient and Accurate Detection of On-Chip Fail-Slow Failures for Many-Core Accelerators","abstract":"Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across the hardware topology. This paper introduces SLOTH, a lightweight, hardware-aware framework for practical on-chip fail-slow detection in many-core accelerators. SLOTH combines workload-aware instrumentation for operator-level monitoring with minimal overhead, on-the-fly trace compression to operate within kilobytes of memory, and a novel topology-aware ranking algorithm to pinpoint a failure's root cause. We evaluate SLOTH on a wide range of representative DNN workloads. The results demonstrate that SLOTH reduces the storage overhead by an average of 115.9$\\times$, while achieving an average fail-slow detection accuracy of 86.77% and a false positive rate (FPR) of 12.11%. More importantly, SLOTH scales effectively across different many-core accelerator architectures, making it practical for large-scale deployments.","short_abstract":"Many-core accelerators are essential for high-performance deep learning, but their performance is undermined by widespread fail-slow failures. Detecting such failures on-chip is challenging, as prior methods from distributed systems are unsuitable due to strict memory limits and their inability to track failures across...","url_abs":"https://arxiv.org/abs/2510.24112","url_pdf":"https://arxiv.org/pdf/2510.24112v2","authors":"[\"Junchi Wu\",\"Xinfei Wan\",\"Zhuoran Li\",\"Yuyang Jin\",\"Guangyu Sun\",\"Yun Liang\",\"Diyu Zhou\",\"Youwei Zhuo\"]","published":"2025-10-28T06:34:51Z","proceeding":"cs.AR","tasks":"[\"cs.AR\"]","methods":"[]","has_code":false}
