{"ID":2847622,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.27486","arxiv_id":"2510.27486","title":"FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models","abstract":"AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate $\\boldsymbol{v}$; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ($\\boldsymbol{v}$, $\\boldsymbol{m}$) at each round slows down convergence. To address these challenges, we propose the first \\underline{Fed}erated \\underline{AdamW} algorithm, called \\texttt{FedAdamW}, for training and fine-tuning various large models. \\texttt{FedAdamW} aligns local updates with the global update using both a \\textbf{local correction mechanism} and decoupled weight decay to mitigate local overfitting. \\texttt{FedAdamW} efficiently aggregates the \\texttt{mean} of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \\texttt{FedAdamW} achieves a linear speedup convergence rate of $\\mathcal{O}(\\sqrt{(L Δσ_l^2)/(S K R ε^2)}+(L Δ)/R)$ without \\textbf{heterogeneity assumption}, where $S$ is the number of participating clients per round, $K$ is the number of local iterations, and $R$ is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \\texttt{FedAdamW} on language and vision Transformer models. Compared to several baselines, \\texttt{FedAdamW} significantly reduces communication rounds and improves test accuracy. The code is available in https://github.com/junkangLiu0/FedAdamW.","short_abstract":"AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high v...","url_abs":"https://arxiv.org/abs/2510.27486","url_pdf":"https://arxiv.org/pdf/2510.27486v3","authors":"[\"Junkang Liu\",\"Fanhua Shang\",\"Hongying Liu\",\"Yuxuan Tian\",\"Yuanyuan Liu\",\"Jin Liu\",\"Kewen Zhu\",\"Zhouchen Lin\"]","published":"2025-10-31T14:04:43Z","proceeding":"cs.LG","tasks":"[\"cs.LG\",\"cs.AI\"]","methods":"[\"Vision Transformer\",\"Transformer\"]","has_code":false,"code_links":[{"ID":607547,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_id":2847622,"paper_url":"https://arxiv.org/abs/2510.27486","paper_title":"FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models","repo_url":"https://github.com/junkangLiu0/FedAdamW","is_official":false,"mentioned_in_paper":false,"mentioned_in_github":true,"github_stars":0}]}