Magazines  


Book Review Department Editor Warren Keuffel

When Old Math Protects Your Mailbox

Radu State

Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification

by Jonathan A. Zdziarski, No Starch Press, 2005, ISBN 1-59327-052-6, 312 pp., US$39.95.



Spam is a major concern for most of us. Receiving unsolicited email selling blue pills or products of questionable legality is annoying, even if it takes you no more than 15 minutes to clear your inbox of spam.  

These annoyances can have a major impact on the network and the resources necessary for supporting users, applications, and network services. Unsolicited email wastes computing power, storage space, and network bandwidth. One solution is to filter incoming emails and automatically reject the spam. Although this solution might seem trivial, two specific, major constraints drive a spam-filtering approach. The first is accuracy; rejecting legitimate messages or accepting spam will lead to a negative end-user experience. The second constraint is the limited data available in an email message. If we can easily classify long messages because they contain sufficient domain-specific words, doing the same with short messages is much more challenging.

I read Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification to learn the conceptual approaches to performing these tasks—and as the subtitle alludes, the Bayesian framework is once more showing its potential and applicability.

Reminding us that elegant things are often simple, Thomas Bayes not only discovered a formula giving the posterior probability but also established a framework to express knowledge and to reason about a probabilistic experiment’s parameters. Now, three centuries later, Jonathan A. Zdziarskiteaches this lesson again and shows us that building the right decision matrix, statistical tests (such as Fisher-Robinson’s inverse chi-square and Robinson’s geometric-mean test), and good tokenization techniques are the correct ingredients for effective spam filtering. Besides the necessary mathematical background, Zdiziarski goes into detailed operational and deployment architectures, covering distributed architectures, storage requirements, and associated compression and hashing methods. The last part of the book lets you rediscover Andrei Markov and how to use Markov chains for language filtering. Zdiziarski describes a complementary filtering approach based on Markov chains and compares their performance with the Bayesian solution.  

Ending Spam is an intellectually refreshing book and fun to read, covering multiple topics while still focusing on the main objective. In roughly 300 dense, well-written pages, you can read about mathematics, networking, and language classification; discover new phrases such as “word salad” and “Bayesian poisoning”; and learn how the bad guys abuse Internet protocols and human perception capabilities to bypass the most severe filtering.

I highly recommend Ending Spam to all who want to learn the theoretical background underlying the most effective content-filtering solutions. For those implementing filtering solutions or for graduate students of language processing or information retrieval, this book is a must-read. Finally, spammers will be more than curious to learn how Bayes, an 18th century reverend, was able to defeat them.    


Radu State is a senior researcher at Inria. He also teaches a graduate-level computer security class at Henri Poincaré University. Contact him at radu.state@loria.fr.

         

About Us

Mission, Vision & Goals
History
Awards Program
Volunteer Leadership
Staff Leadership

Contact Us

Member Resources

Volunteer Center

For More Information