{"ID":2873228,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.18114","arxiv_id":"2509.18114","title":"A Study of Skews, Imbalances, and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU","abstract":"Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency, particularly during the decode phase where load imbalance across GPU shards can cause throughput degradation and latency spikes. A DPU-assisted framework leveraged by BlueField-3 Data Processing Units can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference. By offloading monitoring tasks to the DPU and analyzing GPU telemetry and inter-node communication patterns, the resulting system can provide actionable feedback to inference controllers and schedulers. The goal of this study is three-fold i) identify the reported skews/imbalances/pathological conditions that arise in muti-GPU execution of a) LLM tensor computing (both during training and inference), b) identify their impact on computational performance, and c) make a critical assessment if those can be tracked for potential mitigation from a DPU network.","short_abstract":"Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency, particularly during the decode phase where load imbalance across GPU shards can cause throughput degradation and latency spikes. A DPU-assisted framework leveraged by BlueField-3 Data Proce...","url_abs":"https://arxiv.org/abs/2509.18114","url_pdf":"https://arxiv.org/pdf/2509.18114v1","authors":"[\"Javed I. Khan an Henry Uwabor Moye\"]","published":"2025-09-09T23:43:05Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}