{"ID":2896432,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.07144","arxiv_id":"2507.07144","title":"M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure","abstract":"As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate failure prediction through the analysis of memory error logs (i.e., Correctable Errors) imperative. Existing memory failure prediction approaches have notable limitations: rule-based expert models suffer from limited generalizability and low recall rates, while automated feature extraction methods exhibit suboptimal performance. To address these limitations, we propose M$^2$-MFP: a Multi-scale and hierarchical memory failure prediction framework designed to enhance the reliability and availability of cloud infrastructure. M$^2$-MFP converts Correctable Errors (CEs) into multi-level binary matrix representations and introduces a Binary Spatial Feature Extractor (BSFE) to automatically extract high-order features at both DIMM-level and bit-level. Building upon the BSFE outputs, we develop a dual-path temporal modeling architecture: 1) a time-patch module that aggregates multi-level features within observation windows, and 2) a time-point module that employs interpretable rule-generation trees trained on bit-level patterns. Experiments on both benchmark datasets and real-world deployment show the superiority of M$^2$-MFP as it outperforms existing state-of-the-art methods by significant margins. Code and data are available at this repository: https://github.com/hwcloud-RAS/M2-MFP.","short_abstract":"As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate failure prediction through the analysis of memory error logs (i.e., Correctable E...","url_abs":"https://arxiv.org/abs/2507.07144","url_pdf":"https://arxiv.org/pdf/2507.07144v1","authors":"[\"Hongyi Xie\",\"Min Zhou\",\"Qiao Yu\",\"Jialiang Yu\",\"Zhenli Sheng\",\"Hong Xie\",\"Defu Lian\"]","published":"2025-07-09T05:50:13Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[]","has_code":false,"code_links":[{"ID":612275,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2896432,"paper_url":"https://arxiv.org/abs/2507.07144","paper_title":"M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure","repo_url":"https://github.com/hwcloud-RAS/M2-MFP","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
