This Article 
 Bibliographic References 
 Add to: 
Speed Up Statistical Spam Filter by Approximation
January 2011 (vol. 60 no. 1)
pp. 120-134
Zhenyu Zhong, McAfee Inc., Alpharetta GA
Kang Li, The University of Georgia, Athens
Statistical-based Bayesian filters have become a popular and important defense against spam. However, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise level mail servers. For example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in CPU speed. We conduct a comprehensive study to address this scaling issue by proposing a series of acceleration techniques that speed up Bayesian filters based on approximate classifications. The core approximation technique uses hash-based lookup and lossy encoding. Lookup approximation is based on the popular Bloom filter data structure with an extension to support value retrieval. Lossy encoding is used to further compress the data structure. While these approximation methods introduce additional errors to a strict Bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification. We demonstrate a 6{\times} speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters.

[1] "Quantizing for Minimum Distortion," IEEE Trans. Information Theory, vol. 28, no. 2, pp. 7-12, Mar. 1960.
[2] T. Berger and J.D. Gibson, "Lossy Source Coding," IEEE Trans. Information Theory, vol. 44, no. 6, pp. 2693-2723, Oct. 1998.
[3] B. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," Comm. ACM, vol. 13, no. 7, pp. 422-426, July 1970.
[4] A. Broder and M. Mitzenmacher, "A Survey of Network Applications of Bloom Filters," Internet Math., vol. 1, no. 4, pp. 485-509, 2004.
[5] F. Chang, K. Li, and W. Feng, "Approximate Packet Classification Caching," Proc. IEEE INFOCOM, Mar. 2004.
[6] M. Delio, "Not All Asian E-Mail is Spam," Proc. Wired News Article,,1283,50455,00. html , Feb. 2002.
[7] S. Dharmapurikar, P. Krishnamurthy, and D.E. Taylor, "Longest Prefix Matching Using Bloom Filters," Proc. ACM SIGCOMM '03, pp. 201-212, Aug. 2003.
[8] End-to-End Interest Research Group, "The end2end-interest Archives," /, Feb. 2006.
[9] O. Erdogan and P. Cao, "Hash-AV: Fast Virus Signature Scanning by Cache-Resident Filters," Proc. IEEE Global Comm. Conf. (GLOBECOM '05), Nov. 2005.
[10] C. Estan and G. Varghese, "New Directions in Traffic Measurement and Accounting," Proc. Internet Measurement Workshop, pp. 75-80, Nov. 2001.
[11] L. Fan, P. Cao, and J. Almeida, A.Z. Broder, "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol," IEEE and ACM Trans. Networking, vol. 8, no. 3, pp. 281-293, June 2000.
[12] D. Gall, "MPEG: A Video Compression Standard for Multimedia Applications," Comm. ACM, vol. 34, pp. 46-58, Apr. 1991.
[13] P. Graham, "A Plan for Spam," Proc. Frist SPAM Conf., http:/, Jan. 2003.
[14] D. Heckerman and M.P. Wellman, "Bayesian Networks," Comm. ACM, vol. 38, no. 3, pp. 27-30, Mar. 1995.
[15] S. Holden, "Spam Filter Evaluations,", Nov. 2005.
[16] P. Jacob, "The Spam Problem: Moving Beyond RBLs," rbl-badrbl-bad.html, 2003.
[17] P.L.C. Jeff Yan, "Enhancing Collaborative Spam Detection with Blom Filters," Proc. 22nd Ann. Computer Security Applications Conf., pp. 414-428, 2006.
[18] P. Judge, "The SPAM Archive," http:/, Feb. 2006.
[19] K. Li, F. Chang, W. chang Feng, and D. Burger, "Architecture for Packet Classification Caching," Proc. IEEE Int'l Council on Nanotechnology (ICON '03), May 2003.
[20] D. Lowd and C. Meek, "Good Word Attacks on Statistical Spam Filters," Proc. Second Email and SPAM Conf., July 2005.
[21] T. Meyer and B. Whateley, "SpamBayes: Effective open-source, Bayesian Based, Email Classifications," Proc. First Email and SPAM Conf., July 2004.
[22] Perl Monger Users, "Perl Monger Mailing-List," org/pipermailclassiccity-pm /, Feb. 2006.
[23] E.S. Raymond, "Bogofilter: A Fast Open Source Bayesian Spam Filters," http:/, Nov. 2005.
[24] R. Rivest, "The MD5 Message-Digest Algorithm," RFC 1321, Apr. 1992.
[25] G. Robinson, "A Statistical Approach to the Spam Problem," Linux J., vol. 107, p. 3, http://www.linuxjournal.comarticle. php?sid=6467 , Mar. 2003.
[26] V. Schryver, "Distribute Spam Clearinghouse," http://www., Nov. 2005.
[27] M. Sergeant, "Internet Level Spam Detection and Spamassassin," Proc. 2003 Spam Conf., Jan. 2003.
[28] A. Snoeren, C. Partridge, L. Sanchez, C. Jones, F. Tchakountio, S. Kent, and W. Strayer, "Hash-Based IP Traceback," Proc. ACM SIGCOMM '01, Aug. 2001.
[29] G.L. Wittel and S.F. Wu, "On Attacking Statistical Spam Filters," Proc. First Email and SPAM Conf., July 2004.
[30] A. Wood, "Quick Spam Filter,", Nov. 2005.
[31] Z. Zhong and K. Li, "Speed Up Statistical Spam Filter by Approximation," Technical Report 10-001, Computer Science Department at the Univ. of Georiga, ., Jan. 2010.

Index Terms:
Computer systems organization, performance attributes, information systems applications, miscellaneous, spam, bloom filter, approximation.
Zhenyu Zhong, Kang Li, "Speed Up Statistical Spam Filter by Approximation," IEEE Transactions on Computers, vol. 60, no. 1, pp. 120-134, Jan. 2011, doi:10.1109/TC.2010.92
Usage of this product signifies your acceptance of the Terms of Use.