Better Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels

TitleBetter Malware Ground Truth: Techniques for Weighting Anti-Virus Vendor Labels
Publication TypeConference Paper
Year of Publication2015
AuthorsKantchelian, A., Tschantz M. Carl, Afroz S., Miller B., Shankar V., Bachwani R., Joseph A. D., & Tygar J.D..
Published inProceedings of the 8th ACM Workshop on Artificial Intelligence and Security
Place PublishedNew York, NY, USA
ISBN Number978-1-4503-3826-4
Keywordsaggregating labels, anti-virus vendors, expectation-maximization, labeling problem

We examine the problem of aggregating the results of multiple anti-virus (AV) vendors' detectors into a single authoritative ground-truth label for every binary. To do so, we adapt a well-known generative Bayesian model that postulates the existence of a hidden ground truth upon which the AV labels depend. We use training based on Expectation Maximization for this fully unsupervised technique. We evaluate our method using 279,327 distinct binaries from VirusTotal, each of which appeared for the first time between January 2012 and June 2014.

Our evaluation shows that our statistical model is consistently more accurate at predicting the future-derived ground truth than all unweighted rules of the form "k out of n" AV detections. In addition, we evaluate the scenario where partial ground truth is available for model building. We train a logistic regression predictor on the partial label information. Our results show that as few as a 100 randomly selected training instances with ground truth are enough to achieve 80% true positive rate for 0.1% false positive rate. In comparison, the best unweighted threshold rule provides only 60% true positive rate at the same false positive rate.


This research is supported in part by Intel's ISTC for Secure Computing, NSF grants 0424422 (TRUST) and 1139158, the Freedom 2 Connect Foundation, US State Dept. DRL, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, and gifts from Amazon, Google, SAP, Apple, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOn Talis, General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and Yahoo!. The opinions in this paper are those of the authors and do not necessarily reflect those of any funding sponsor or the United States Government.

ICSI Research Group

Networking and Security