{"ID":2872014,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2509.09091","arxiv_id":"2509.09091","title":"Towards Confidential and Efficient LLM Inference with Dual Privacy Protection","abstract":"CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (LLMs) result in significant communication overhead between TEEs and GPUs. DP-based approaches apply random noise to protect data privacy, but this compromises LLM performance and semantic understanding. To overcome the above drawbacks, this paper proposes CMIF, a Confidential and efficient Model Inference Framework. CMIF confidentially deploys the embedding layer in the client-side TEE and subsequent layers on GPU servers. Meanwhile, it optimizes the Report-Noisy-Max mechanism to protect sensitive inputs with a slight decrease in model performance. Extensive experiments on Llama-series models demonstrate that CMIF reduces additional inference overhead in TEEs while preserving user data privacy.","short_abstract":"CPU-based trusted execution environments (TEEs) and differential privacy (DP) have gained wide applications for private inference. Due to high inference latency in TEEs, researchers use partition-based approaches that offload linear model components to GPUs. However, dense nonlinear layers of large language models (LLM...","url_abs":"https://arxiv.org/abs/2509.09091","url_pdf":"https://arxiv.org/pdf/2509.09091v1","authors":"[\"Honglan Yu\",\"Yibin Wang\",\"Feifei Dai\",\"Dong Liu\",\"Haihui Fan\",\"Xiaoyan Gu\"]","published":"2025-09-11T01:54:13Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.AI\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}