{"ID":2880938,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2508.12590","arxiv_id":"2508.12590","title":"Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding","abstract":"To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous U-HLM baseline, our method improves BERTScore from 85.8% to 87.0%, energy savings from 31.6% to 43.6%, and throughput from 0.36 to 0.40. This approach enables an energy-efficient and accurate deployment of LLMs in bandwidth-constrained edge environments.","short_abstract":"To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communicat...","url_abs":"https://arxiv.org/abs/2508.12590","url_pdf":"https://arxiv.org/pdf/2508.12590v1","authors":"[\"Jihoon Park\",\"Seungeun Oh\",\"Seong-Lyun Kim\"]","published":"2025-08-18T02:56:59Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
