{"ID":2857023,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.10074","arxiv_id":"2510.10074","title":"StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis","abstract":"Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.","short_abstract":"Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, inc...","url_abs":"https://arxiv.org/abs/2510.10074","url_pdf":"https://arxiv.org/pdf/2510.10074v2","authors":"[\"Jiayi Mao\",\"Liqun Li\",\"Yanjie Gao\",\"Zegang Peng\",\"Shilin He\",\"Chaoyun Zhang\",\"Si Qin\",\"Samia Khalid\",\"Qingwei Lin\",\"Saravan Rajmohan\",\"Sitaram Lanka\",\"Dongmei Zhang\"]","published":"2025-10-11T07:18:36Z","proceeding":"cs.AI","tasks":"[\"cs.AI\"]","methods":"[\"Large Language Model\"]","has_code":false,"code_links":[{"ID":608406,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2857023,"paper_url":"https://arxiv.org/abs/2510.10074","paper_title":"StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis","repo_url":"https://github.com/microsoft/StepFly","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}