{"ID":2892178,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2507.15465","arxiv_id":"2507.15465","title":"Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts","abstract":"Computational workloads composing traditional transformer models are starkly bifurcated. Multi-Head Attention (MHA) and Grouped-Query Attention are memory-bound due to low arithmetic intensity, while FeedForward Networks are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the attention bottleneck. This paper argues that recent architectural advances in transformer models -- Multi-head Latent Attention (MLA) and Mixture of Experts (MoE) -- introduce new dominant bottlenecks, shifting the challenge away from memory-intensive attention. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude higher than that of MHA, moving it toward a compute-bound regime well-matched to modern accelerators such as GPUs. Second, distributing MoE experts across a pool of accelerators allows batching to tune their arithmetic intensity to that of dense layers, producing a more balanced computational profile. Consequently, the focus of hardware and system optimization should shift from attention acceleration to high-bandwidth interconnects and balancing expert workloads across accelerators.","short_abstract":"Computational workloads composing traditional transformer models are starkly bifurcated. Multi-Head Attention (MHA) and Grouped-Query Attention are memory-bound due to low arithmetic intensity, while FeedForward Networks are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate...","url_abs":"https://arxiv.org/abs/2507.15465","url_pdf":"https://arxiv.org/pdf/2507.15465v3","authors":"[\"Sungmin Yun\",\"Seonyong Park\",\"Hwayong Nam\",\"Younjoo Lee\",\"Gunjun Lee\",\"Kwanhee Kyung\",\"Sangpyo Kim\",\"Nam Sung Kim\",\"Jongmin Kim\",\"Hyungyo Kim\",\"Juhwan Cho\",\"Seungmin Baek\",\"Jung Ho Ahn\"]","published":"2025-07-21T10:18:33Z","proceeding":"cs.AR","tasks":"[\"cs.AR\",\"cs.AI\"]","methods":"[\"Mixture of Experts\",\"Transformer\",\"Large Language Model\"]","has_code":false}
