{"ID":2822738,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2601.02569","arxiv_id":"2601.02569","title":"LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference","abstract":"Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed layers are left uncompensated. We present \\textbf{LoRA-Drop}, a plug-and-play inference framework that accelerates decoding by applying a \\emph{temporal compute schedule} to a fixed subset of intermediate layers: on most decoding steps, selected layers reuse the previous-token hidden state and apply a low-rank LoRA correction, while periodic \\emph{refresh} steps execute the full model to prevent drift. LoRA-Drop requires no routing network, is compatible with standard KV caching, and can reduce KV-cache footprint by skipping KV updates in droppable layers during LoRA steps and refreshing periodically. Across \\textbf{LLaMA2-7B}, \\textbf{LLaMA3-8B}, \\textbf{Qwen2.5-7B}, and \\textbf{Qwen2.5-14B}, LoRA-Drop achieves up to \\textbf{2.6$\\times$ faster decoding} and \\textbf{45--55\\% KV-cache reduction} while staying within \\textbf{0.5 percentage points (pp)} of baseline accuracy. Evaluations on reasoning (GSM8K, MATH, BBH), code generation (HumanEval, MBPP), and long-context/multilingual benchmarks (LongBench, XNLI, XCOPA) identify a consistent \\emph{safe zone} of scheduling configurations that preserves quality while delivering substantial efficiency gains, providing a simple path toward adaptive-capacity inference in LLMs. Codes are available at https://github.com/hosseinbv/LoRA-Drop.git.","short_abstract":"Autoregressive large language models (LLMs) are bottlenecked by sequential decoding, where each new token typically requires executing all transformer layers. Existing dynamic-depth and layer-skipping methods reduce this cost, but often rely on auxiliary routing mechanisms or incur accuracy degradation when bypassed la...","url_abs":"https://arxiv.org/abs/2601.02569","url_pdf":"https://arxiv.org/pdf/2601.02569v1","authors":"[\"Hossein Rajabzadeh\",\"Maryam Dialameh\",\"Chul B. Park\",\"Il-Min Kim\",\"Hyock Ju Kwon\"]","published":"2026-01-05T21:47:47Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false,"code_links":[{"ID":605442,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2822738,"paper_url":"https://arxiv.org/abs/2601.02569","paper_title":"LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference","repo_url":"https://github.com/hosseinbv/LoRA-Drop.git","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}
