{"ID":2863631,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.00065","arxiv_id":"2510.00065","title":"FedLLM-Align: Feature Extraction From Heterogeneous Clients","abstract":"Federated learning (FL) enables collaborative model training without sharing raw data, making it attractive for privacy-sensitive domains, e.g., healthcare, finance, and IoT. A major obstacle, however, is the potential heterogeneity of tabular data across clients, in practical settings, where schema mismatches and incompatible feature spaces prevent straightforward aggregation. To address this challenge, this paper proposes FedLLM-Align, a federated learning framework that leverages pretrained transformer based language models for feature extraction. Towards this objective, FedLLM-Align serializes tabular records into text and derives semantically aligned embeddings from a pretrained LLM encoder, e.g, DistilBERT, facilitating lightweight local classifier heads that can be trained in a federated manner using standard aggregation schemes, e.g., FedAvg, while keeping all raw data records local. To quantify the merits and trade-offs of FedLLM-Align, we evaluate the proposed framework on binary classification tasks from two different domains: i) Coronary heart disease prediction on partitioned Framingham Heart Study data, and ii) Customer churn prediction on a financial dataset. FedLLM-Align outperforms state-of-the-art baselines by up to 25% in terms of the F1 score, under simulated schema heterogeneity, and achieves a 65% reduction in the communication overhead. These results establish FedLLM-Align as a privacy-preserving and communication-efficient approach for federated training based on clients with heterogeneous tabular datasets, commonly encountered in practice.","short_abstract":"Federated learning (FL) enables collaborative model training without sharing raw data, making it attractive for privacy-sensitive domains, e.g., healthcare, finance, and IoT. A major obstacle, however, is the potential heterogeneity of tabular data across clients, in practical settings, where schema mismatches and inco...","url_abs":"https://arxiv.org/abs/2510.00065","url_pdf":"https://arxiv.org/pdf/2510.00065v2","authors":"[\"Abdelrhman Gaber\",\"Muhammad ElMahdy\",\"Youssif Abuzied\",\"Hassan Abd-Eltawab\",\"Tamer ElBatt\"]","published":"2025-09-29T14:06:52Z","proceeding":"cs.LG","tasks":"[\"cs.LG\"]","methods":"[\"Transformer\",\"Large Language Model\",\"Language Model\"]","has_code":false}
