Publication Details
Title: A Large-Scale Empirical Analysis of Email Spam Detection Through Transport-Level Characteristics
Author: T. Ouyang, S. Ray, M. Allman, and M. Rabinovich
Group: ICSI Technical Reports
Date: January 2010
PDF: http://www.icsi.berkeley.edu/pubs/techreports/TR-10-001.pdf
Overview:
Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper we evaluate the efficacy of using a machine learning-based model of the transport layer characteristics of email traffic to identify spam. The underlying idea is that the manner in which spam is transmitted has an impact that is statistically observable in the traffic (e.g., in the network round-trip time or jitter between packets). Therefore, by identifying a solid set of traffic features we can construct a model that can identify spam without relying on expensive content filtering. We carry out a large scale empirical analysis of this idea with data collected over the course of one year (roughly 600K messages). With this data, we train classifiers using machine learning methods and test several hypotheses. First, we validate prior results using similar techniques. Second, we determine which transport characteristics contribute most significantly to the detection process. Third, we analyze the behavior of our detectors over weekly and monthly intervals and in the presence of major network events. Finally, we evaluate the behavior of our detectors in a practical setting where they are used in a filtering pipeline along with standard off-the-shelf content filtering methods, and demonstrate that they can lead to computational savings in practice.
Bibliographic Information:
ICSI Technical Report TR-10-001
Bibliographic Reference:
T. Ouyang, S. Ray, M. Allman, and M. Rabinovich. A Large-Scale Empirical Analysis of Email Spam Detection Through Transport-Level Characteristics. ICSI Technical Report TR-10-001, January 2010
Author: T. Ouyang, S. Ray, M. Allman, and M. Rabinovich
Group: ICSI Technical Reports
Date: January 2010
PDF: http://www.icsi.berkeley.edu/pubs/techreports/TR-10-001.pdf
Overview:
Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper we evaluate the efficacy of using a machine learning-based model of the transport layer characteristics of email traffic to identify spam. The underlying idea is that the manner in which spam is transmitted has an impact that is statistically observable in the traffic (e.g., in the network round-trip time or jitter between packets). Therefore, by identifying a solid set of traffic features we can construct a model that can identify spam without relying on expensive content filtering. We carry out a large scale empirical analysis of this idea with data collected over the course of one year (roughly 600K messages). With this data, we train classifiers using machine learning methods and test several hypotheses. First, we validate prior results using similar techniques. Second, we determine which transport characteristics contribute most significantly to the detection process. Third, we analyze the behavior of our detectors over weekly and monthly intervals and in the presence of major network events. Finally, we evaluate the behavior of our detectors in a practical setting where they are used in a filtering pipeline along with standard off-the-shelf content filtering methods, and demonstrate that they can lead to computational savings in practice.
Bibliographic Information:
ICSI Technical Report TR-10-001
Bibliographic Reference:
T. Ouyang, S. Ray, M. Allman, and M. Rabinovich. A Large-Scale Empirical Analysis of Email Spam Detection Through Transport-Level Characteristics. ICSI Technical Report TR-10-001, January 2010
