{"ID":2863172,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2510.02391","arxiv_id":"2510.02391","title":"LLM-Generated Samples for Android Malware Detection","abstract":"Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of large language models (LLMs) in generating effective malware data for detection tasks remains underexplored. In this study, we fine-tune GPT-4.1-mini to produce structured records for three malware families: BankBot, Locker/SLocker, and Airpush/StopSMS, using the KronoDroid dataset. After addressing generation inconsistencies with prompt engineering and post-processing, we evaluate multiple classifiers under three settings: training with real data only, real-plus-synthetic data, and synthetic data alone. Results show that real-only training achieves near perfect detection, while augmentation with synthetic data preserves high performance with only minor degradations. In contrast, synthetic-only training produces mixed outcomes, with effectiveness varying across malware families and fine-tuning strategies. These findings suggest that LLM-generated malware can enhance scarce datasets without compromising detection accuracy, but remains insufficient as a standalone training source.","short_abstract":"Android malware continues to evolve through obfuscation and polymorphism, posing challenges for both signature-based defenses and machine learning models trained on limited and imbalanced datasets. Synthetic data has been proposed as a remedy for scarcity, yet the role of large language models (LLMs) in generating effe...","url_abs":"https://arxiv.org/abs/2510.02391","url_pdf":"https://arxiv.org/pdf/2510.02391v1","authors":"[\"Nik Rollinson\",\"Nikolaos Polatidis\"]","published":"2025-09-30T23:46:57Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.LG\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}
