{"ID":3084729,"CreatedAt":"2026-06-05T06:46:15.197025399Z","UpdatedAt":"2026-06-06T23:37:10.056013449Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.05516","arxiv_id":"2606.05516","title":"Dominant-Layer ZO: A Single Layer Dominates Zeroth-Order Fine-Tuning of LLMs","abstract":"Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Across multiple LLM families and downstream tasks, fine-tuning this dominant layer alone consistently matches or even exceeds full-model ZO fine-tuning. We further show that the dominant layer is task-agnostic but model-specific, and can be identified before training through a simple inference-only analysis of activation outliers. Specifically, the dominant layer consistently aligns with the first activation-outlier layer in the pre-trained model. To explain this phenomenon, we analyze how perturbation effects propagate under ZO optimization. We find that the dominant layer combines two key properties: high perturbation sensitivity and early placement in the residual stream, allowing perturbation-induced effects to propagate and accumulate through remaining subsequent decoding layers. As a result, this layer produces disproportionately strong and stable optimization signals under forward-only updates. Extensive experiments on LLaMA2-7B and Qwen3-8B across nine benchmarks show that dominant-layer ZO fine-tuning improves average performance over full-model MeZO and LoRA-based ZO fine-tuning while achieving up to 4.52$\\times$ training speedup.","short_abstract":"Zeroth-order (ZO) optimization enables memory-efficient fine-tuning of large language models (LLMs) using only forward passes, but it remains unclear how useful adaptation is distributed across layers. In this work, we reveal a surprising phenomenon: ZO fine-tuning is sharply dominated by a single decoding layer. Acros...","url_abs":"https://arxiv.org/abs/2606.05516","url_pdf":"https://arxiv.org/pdf/2606.05516v1","authors":"[\"Wanhao Yu\",\"Ziyan Wang\",\"Zheng Wang\",\"Abeer Matar Almalky\",\"Yihang Zuo\",\"Shuteng Niu\",\"Sen Lin\",\"Adnan Siraj Rakin\",\"Deliang Fan\",\"Li Yang\"]","published":"2026-06-03T23:42:09Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\",\"LoRA\"]","has_code":false}