{"ID":2877811,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.20274","arxiv_id":"2508.20274","title":"Predictable LLM Serving on GPU Clusters","abstract":"Latency-sensitive inference on shared A100 clusters often suffers noisy-neighbor interference on the PCIe fabric, inflating tail latency and SLO violations. We present a fabric-agnostic, VM-deployable host-level controller that combines dynamic Multi-Instance GPU (MIG) reconfiguration, PCIe-aware placement, and lightweight guardrails (MPS quotas, cgroup I/O). It samples per-tenant tails and system signals, uses topology hints to avoid PCIe hot spots, and gates actions with dwell/cool-down to avoid thrash. On a single host and a 2-node (16-GPU) cluster, SLO miss-rate is reduced by \\(\\approx\\)32\\% (\\(\\approx\\)1.5) and p99 latency improves \\(\\approx\\)15\\% with \\(\\leq\\)5\\% throughput cost versus static MIG and naive placement; ablations show MIG and placement contribute comparably. We also evaluate LLM serving with vLLM on OLMo 2 7B Instruct: TTFT p99 improves \\(\\approx\\)10--15\\% at \\(\\leq\\)5\\% cost without changing the controller.","short_abstract":"Latency-sensitive inference on shared A100 clusters often suffers noisy-neighbor interference on the PCIe fabric, inflating tail latency and SLO violations. We present a fabric-agnostic, VM-deployable host-level controller that combines dynamic Multi-Instance GPU (MIG) reconfiguration, PCIe-aware placement, and lightwe...","url_abs":"https://arxiv.org/abs/2508.20274","url_pdf":"https://arxiv.org/pdf/2508.20274v1","authors":"[\"Erfan Darzi\",\"Shreeanant Bharadwaj\",\"Sree Bhargavi Balija\"]","published":"2025-08-27T21:15:41Z","proceeding":"cs.DC","tasks":"[\"cs.DC\"]","methods":"[\"Large Language Model\"]","has_code":false}
