A Large-Scale Empirical Analysis of Email Spam Detection Through Network Characteristics in a Stand-Alone Enterprise

TitleA Large-Scale Empirical Analysis of Email Spam Detection Through Network Characteristics in a Stand-Alone Enterprise
Publication TypeJournal Article
Year of Publication2014
AuthorsOuyang, T., Ray S., Allman M., & Rabinovich M.
Published inComputer Networks
Volume59
Page(s)101-121
Other Numbers3673
Abstract

Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper, we envision spam filtering as a pipeline consisting of DNS blacklists, filters based on SYN packet features, filters based on traffic characteristics and filters based on message content. Each stage of the pipeline examines more information in the message but is more computationally expensive. A message is rejected as spam once any layer is sufficiently confident. We analyze this pipeline, focusing on the first three layers, from a single-enterprise perspective. To do this we use a large email dataset collected over two years. We devise a novel ground truth determination system to allow us to label this large dataset accurately. Using two machine learning algorithms, we study (i) how the different pipeline layers interact with each other and the value added by each layer, (ii) the utility of individual features in each layer, (iii) stability of the layers across time and network events and (iv) an operational use case investigating whether this architecture can be practically useful. We find that (i) the pipeline architecture is generally useful in terms of accuracy as well as in an operational setting, (ii) it generally ages gracefully across long time periods and (iii) in some cases, later layers can compensate for poor performance in the earlier layers. Among the caveats we find are that (i) the utility of network features is not as high in the single enterprise viewpoint as reported in other prior work, (ii) major network events can sharply affect the detection rate, and (iii) the operational (computational) benefit of the pipeline may depend on the efficiency of the final content filter.

Acknowledgment

This work was partially supported by funding provided to ICSI through National Science Foundation grant CNS : 0433702 (“Center for Internet Epidemiology and Defenses”). Additional support was provided through National Science Foundation grants CNS : 0831821 ("Relationship-Oriented Networking") and CNS : 0916407 ("Understanding the Roots of the Spam Problem -- Email Address Trafficking"). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the National Science Foundation.

URLhttps://www.icsi.berkeley.edu/pubs/networking/largescale14.pdf
Bibliographic Notes

Computer Networks, Vol. 59, pp. 101-121

Abbreviated Authors

T. Ouyang, S. Ray, M. Allman, and M. Rabinovich

ICSI Research Group

Networking and Security

ICSI Publication Type

Article in journal or magazine