Enhancing reliability in AI inference services: An empirical study on real production incidents
Abstract
Hyperscale large language model (LLM) inference places extraordinary demands on cloud systems, where even brief failures can translate into significant user and business impact. To better understand and mitigate these risks, we present one of the first provider-internal, practice-based analysis of LLM inference incidents. We developed a taxonomy and methodology grounded in a year of operational experience, validating it on 156 high-severity incidents, and conducted a focused quantitative study of Apr-Jun 2025 to ensure recency and relevance. Our approach achieves high labeling consistency (Cohen's K ~0.89), identifies dominant failure modes (in our dataset ~60% inference engine failures, within that category ~40% timeouts), and surfaces mitigation levers (~74% auto-detected; ~28% required hotfix). Beyond hotfixes, many incidents were mitigated via traffic routing, node rebalancing, or capacity increase policies, indicating further automation opportunities. We also show how the taxonomy guided targeted strategies such as connection liveness, GPU capacity-aware routing, and per-endpoint isolation and reduced incident impact and accelerated recovery. Finally, we contribute a practitioner-oriented adoption checklist that enables others to replicate our taxonomy, analysis, and automation opportunities in their own systems. This study demonstrates how systematic, empirically grounded analysis of inference operations can drive more reliable and cost-efficient LLM serving at scale.