{"ID":2863850,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.25155","arxiv_id":"2509.25155","title":"Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units","abstract":"The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with NPU memory and compute patterns. This paper presents a comprehensive performance analysis of causal inference operators on a modern NPU, benchmarking quadratic attention against sub-quadratic alternatives including structured state-space models and causal convolutions. Our analysis reveals a spectrum of critical bottlenecks: quadratic attention becomes severely memory-bound with catastrophic cache inefficiency, while sub-quadratic variants span from compute-bound on programmable vector cores to memory-bound by data movement. These findings provide essential insights for co-designing hardware-aware models and optimization strategies to enable efficient long-context inference on edge platforms.","short_abstract":"The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with N...","url_abs":"https://arxiv.org/abs/2509.25155","url_pdf":"https://arxiv.org/pdf/2509.25155v2","authors":"[\"Neelesh Gupta\",\"Rakshith Jayanth\",\"Dhruv Parikh\",\"Viktor Prasanna\"]","published":"2025-09-29T17:55:43Z","proceeding":"cs.DC","tasks":"[\"cs.DC\",\"cs.LG\"]","methods":"[\"Language Model\"]","has_code":false}
