{"ID":2840731,"CreatedAt":"2026-06-01T04:54:23.091178241Z","UpdatedAt":"2026-06-01T04:54:23.091178241Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2511.13808","arxiv_id":"2511.13808","title":"Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora","abstract":"A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35\\times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\\% improvement in AUC at detecting new malware. We show theoretically and empirically that our approach will select the top-k items with little error and the interplay between theory and engineering required to achieve these results.","short_abstract":"A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our pro...","url_abs":"https://arxiv.org/abs/2511.13808","url_pdf":"https://arxiv.org/pdf/2511.13808v1","authors":"[\"Edward Raff\",\"Ryan R. Curtin\",\"Derek Everett\",\"Robert J. Joyce\",\"James Holt\"]","published":"2025-11-17T17:46:23Z","proceeding":"cs.CR","tasks":"[\"cs.CR\",\"cs.LG\",\"cs.MS\"]","methods":"[]","has_code":false}
