Spam is something that almost all email users have experienced, and it just keeps getting worse. Brightmail, an anti-spam software provider who provide spam filtering for hotmail amongst others, recently reported that in January 2004 sixty percent of all email they processed was spam. In January 2003 forty percent of email they processed was spam, clearly spam is a fast growing problem.

When viruses which propagate by emailing themselves to every email address they can find and the bounce messages generated by anti-virus software detecting those emails and reporting them to the random email address the virus chose to falsify the message from are also considered the volume of unwanted email is almost unmanageable.

Email filtering is one possible solution to the problem of finding the emails you want amongst the sea of unwanted spams and viruses. Filtering is usually an after the fact measure, which means the resources used in the delivery of the unwanted email are still used. Other methods such as spammer IP blacklisting can also save the resources that would otherwise be used in the transport and delivery of spam.

In this article I am going to examine a number of Bayesian email filters and see how well they perform at distinguishing spams and virus related emails from the emails most of use want. SpamAssassin will also be used as it is probably the most famous of the spam filters and is thus a useful benchmark. It should be noted that SpamAssassin is specifically for filtering spam and I am also filtering out virus emails so SpamAssassin will probably perform relatively poorly in that test which in no way reflects on the performance of SpamAssassin in filtering spam - something it excels at.

Bayesian Filtering

Bayesian spam filtering is an extremely popular way of filtering spam. This is probably due to the fact that the Bayesian technique is reasonably simple to both understand and implement while being an effective machine learning algorithm which runs quickly.

In a nutshell a corpus of known good email and a corpus of known bad email are tokenised and the occurances of those tokens are tracked - this is training the filter. When an email is received it is also tokenised. If it has tokens similar to the good emails then it is classified as good, if it has tokens similar to the bad emails then it is classified as bad. There is lots of leeway in this, different tokenising systems will produce different results, which and how many tokens to use in the classification will produce different results, there are also variations on the Bayesian technique itself.

There are many other machine learning algorithms but none of them have been used to the extent that the Bayesian technique has for filtering email. Thus we will be comparing different Bayesian filters with their different implementation decisions.

The Filters

Annoyance Filter

A C++ implementation which tries to parse all sorts of email, from character sets for languages the user knows nothing about to decoding strings in PDF attachments. Version 1.0b was used and is in the public domain.

AntiSpam Mail Filter

A C++ implementation designed to be used with the exim MTA. Version 1.2 was used which is licensed under the GPL (General Public License).

Bayespam

A perl implementation designed to be used with the qmail MTA. Version 0.9.2 was used which is licensed under the GPL.

Bayesian Mail Filter

A C implementation aiming to be faster, smaller, and more versatile than similar applications. Version 0.9.4 was used which is licensed under the GPL.

Bogofilter

A C implementation which processes both plain text and email as well as handling multi-part mime messages. Version 0.16.4 was used which is licensed under the GPL.

crm114

The Controllable Regex Mutilator, a language that is designed for doing filtering and hence can act as a mail filter. Release 20040102-1.0-SanityCheck was used which is licensed under the GPL.

dbacl

A general Bayesian text classifier, implemented in C, which has an email option and can classify into more than just two categories. Version 1.6 was used which is licensed under the GPL.

DSPAM

A C implementation designed to be used system-wide on a large scale and providing advanced tokenising features. Version 2.8.3 was used which is licensed under the GPL.

Ifile

A general Bayesian email filter which can classify into more than just two categories, possibly the first publically available Bayesian classifier for email, written in C. Version 1.3.3 was used which is licensed under the GPL.

Quick Spam Filter

A C implementation designed be small, fast, reliable, easy to install, and simple to use in a procmail recipe. Version 0.9.25 was used which is licensed under the Artistic License.

SpamAssassin

A heuristic based spam filter which uses a Bayesian classifier as one of the heuristics, included mainly as a benchmark, implemented in perl. Version 2.63 was used which is licensed under the Artistic License.

SpamBayes

A Python implementation which emphasises testing new approaches to Bayesian spam filtering. Version 1.0a7 was used which is licensed under the PSF (Python Software Foundation) license.

SpamOracle

A Bayesian spam filter designed to work with procmail and written in Objective Caml. Version 1.4 was used which is licensed under the GPL.

SpamProbe

A C++ implementation which handles MIME and skips non-text attachments. Version 0.9g2 was used which is licensed under the QPL (Q Public License).

The SpamAssassin Public Corpus

The SpamAssassin project provides a public corpus of both spam and non-spam email which is a reasonable data set to use for comparing the filters. The corpus is more useful than most spam collections since it also contains non-spam emails which in practice are much harder to find. I used the 2003-02-28 collections using the easy_ham, hard_ham and easy_ham_2 sets for the non-spam email, and the spam and spam_2 sets for the spam email giving 4150 non-spam emails and 1897 emails.

The non-spam emails were randomly divided into ten equal sized sets of 415 emails, and the spam email were randomly divided into ten sets of 189 or 190 emails. Then for each of the ten sets the remaining nine sets were used for training the filters before the set was classified. There was no training done during the classification so if a filter classified an email incorrectly it was not trained on that email, some filters may automatically train on the email they classify but any errors they made were not manually corrected by training.

Two statistics provide the information we need to compare the filters. How much spam do they miss, and how much non-spam do they mark as spam. Obviously a filter that misses most of the spam isn't very useful, more importantly a filter that marks lots of non-spam as spam is useless since you have to look at the spam to find anything that was misclassified. Most people prefer a filter that misses a few spams but does not classify non-spams as spam, after all deleting a few spams is far less of a problem than missing an important email from your boss, potential boss, or partner. These two statistics will be provided by reporting the Miss Rate and the False Positive Rate. The miss rate is the percentage of spam that is not reported as spam. The false positive rate is the percentage of non-spam that is reported as spam.

All of the filters used provide some kind of spam indicator which marks an email as spam, they also provide a number of some sort. Most of the Bayesian filters provide the probability that the email is spam, while SpamAssassin provides a hit score. The higher the score the more likely the email is spam (crm114 is slightly different, in that the lower the score the more likely the email is spam). By ignoring the spam indicator and using the score the filter can be made more or less strict. Requiring a higher score for an email to be marked as spam will increase the miss rate while decreasing the false positive rate. Conversely using a lower score to mark spam will decrease the miss rate while increasing the false positive rate. Since different people have different requirements for a spam filter the results will be analysed over all the scores and the corresponding miss rates and false positive rates will be reported.

Error Rates With Real Training Spam Chart

Figure 1

Error Rates With SpamAssassin Public Corpus Chart

Figure 2

Error Rates With SpamAssassin Public Corpus Chart Zoomed

Figure 2

The error rates of the filters are shown in Figures 1 through 3. Figure 1 instantly tells us that dbacl performed far worse than the other filters but the rest of the filters are too close to each other to distinguish. Figures 2 and 3 show only the better filters by only showing results for which the miss rate and false positive rate were less than 10% for Figure 2 and less than 4% and 3% respectively for Figure 3.

The error rate charts are easy to interpret. If the lines plotted by two filters do not cross then the filter with the lower line is better. If the lines do cross, then which filter is better will depend upon the false positive and miss rates you are prepared to put up with.

Examining Figure 3 it is clear that spamprobe gives the best results if you desire a miss rate of less than 4% and a false positive rate of less than 1.5%. If you are willing to have a higher false positive rate of about 1.75% then spamassassin performs better giving an extremely low miss rate, at a false positive rate of around 2.8% spamassassin did not miss a single spam in this test (out of interest that was achieved by setting the spam score to be -1.5 which would never be done in practice). None of the other filters performed as well as those two with this corpus.

Table 1. SpamAssassin Public Corpus Default Results
Filter Precision Recall FalsePositives FalseNegatives Correct Classifications
Annoyance-filter 99.8% 89.0% 3 209 5835
Antispam 93.3% 98.2% 133 35 5879
Bayespam 94.0% 95.9% 117 77 5853
bmf 99.3% 95.9% 13 78 5956
Bogofilter 99.7% 90.3% 5 184 5858
crm114 97.4% 96.9% 50 58 5939
dbacl 32.8% 98.8% 3841 22 2184
DSPAM 99.3% 96.0% 13 76 5958
Ifile 96.6% 93.2% 63 129 5855
qsf 99.1% 91.6% 16 160 5871
SpamAssassin 99.7% 96.1% 6 74 5967
SpamBayes 99.6% 95.6% 7 83 5957
SpamOracle 100.0% 83.7% 0 309 5738
SpamProbe 99.7% 96.6% 6 65 5976

All of the filters have a default score at which they classify an email spam, Table 1 shows the results if those scores are used instead of the trade offs shown in the Figures. Precision and Recall are used in the table and are defined as follows.

Precision is the percentage of messages classified as spam that are actually spam. It tells you what percentage of the emails that end up in the spam folder will actually be spam.

Recall is the percentage of spam messages that are actually classified as spam. The miss rate plus the recall will always equal 100%.

Precision and recall are the standard metrics used in information literature, whereas miss rate and false positive rate are often used in reporting spam filtering results. I have used both as both are useful and should allow the results here to be more easily compared with results reported else where.

Based upon the default setups SpamProbe and SpamAssassin have high precision and reasonably high recall, at the cost of slightly lower recall more precision can be obtained with Bogofilter and Annoyance-filter. SpamOracle offers the best precision but at the lowest recall. Judging solely by minimising total errors SpamProbe comes out on top with the highest number of correct classifications.

insert table

Personal Mail from December 2003

My main email address is reasonably old and has been used unmunged when posting to Usenet as well as being used unmunged as mailto: links on web pages. As you might expect this means it receives a fair bit of spam. The month in question resulted in 146 good emails and 3557 bad emails. Note, that bad emails are not only spams I am also including viruses and virus bounces. One of the great things about using machine learning techniques for classifying mail, such as Bayesian filtering, is that there is no hardwired definition of just what spam is. The filter learns what spam is from the emails you train it with. So for this test I am including some non-spam but still annoying emails. Logically, you would expect this to give the Bayesian filters an edge over SpamAssassin - after all SpamAssassin probably doesn't contain rules for the non-spam emails I have included in the bad category and those virus emails are almost identical to each other which is optimal for Bayesian filtering.

The methodology was exactly as for the previous tests with the SpamAssassin public corpus. The emails were split into ten sets with each set being used for testing while the others were used for training, giving ten runs through the results of which were merged to give a classification for each of the emails in the corpus.

Error Rates With December Email Chart

Figure 4

Error Rates With December Email Chart Zoomed

Figure 5

The results are shown in Figures 4 and 5. Figure 4 shows us that dbacl performed poorly as it did in the previous test, and again most of the remaining filters are clumped together in the chart making distinguishing between them impossible. Figure 5 is restricted to only show results that have a miss rate and a false positive rate of less than 10% which allows us to compare the better performing filters. Two filters stand out from the rest: bogofilter which catches over 98% of spam without a single false positive, and annoyance-filter which catches even more spam at the cost of the occasional false positive.

Table 2. December Email Default ResultsM
Filter Precision Recall False Positives False Negatives Correct Classifications
Annoyance-filter 99.9% 99.0% 2 34 3667
Antispam 98.3% 99.9% 62 2 3639
Bayespam 99.1% 94.7% 29 189 3485
bmf 99.8% 99.3% 8 25 3670
Bogofilter 100.0% 95.2% 0 169 3534
crm114 99.2% 99.5% 27 19 3657
dbacl 99.3% 37.5% 9 2224 1470
DSPAM 100.0% 51.6% 0 1723 1980
Ifile 99.9% 97.5% 2 90 3611
qsf 100.0% 95.8% 0 149 3554
SpamAssassin 100.0% 49.4% 0 1799 1904
SpamBayes 100.0% 94.7% 1 189 3513
SpamOracle 99.9% 88.6% 2 406 3295
SpamProbe 99.7% 99.7% 10 9 3684

Table 2 shows the results of the filters used with their default settings. Qsf and Bogofilter give extremely good precision, no false positives at all, at reasonably high recalls. Annoyance-filter and bmf catch significantly more spam at the cost of a few false positives. Again, SpamProbe produces the lowest number of total errors.

Examining the False Positives

False positives are especially important when selecting a mail filter to use, having to read the spam folder to find the false positives almost defeats the purpose of using a filter. The number of false positives indicated by both the charts and tables is thus an important consideration. However, it is also useful, not to mention interesting, to see just what the false positives were. It provides some insight into why the filter got the classification wrong, as well as showing us if the false positives could be reduced by using techniques such as white-listing.

From table 2 we can see that there were 152 false positives in total, though there were only 71 unique false positives as the same emails were incorrectly classified by multiple filters. I will summarise the types of emails that were incorrectly classified as spam by each of the filters.

Annoyance-filter produced only two false positives. One was a domain renewal notification from a domain registrar and the other was a paypal subscription receipt. Missing either of those would not be a problem to me, since I keep track of my domain names and when they need to be renewed and check my paypal account from time to time. They are still false positives but not especially problematic ones, once the filter is trained on messages from those sources the problem will go away anyway.

Antispam produced the highest number of false positives by far. The 62 false positives are over twice as many as the next highest count. The false positives were made up of 24 newsletters, 14 work emails, 10 personal emails, eight important work emails, four mail bounce notifications, a domain renewal notice, and a paypal subscription receipt. Work emails includes emails regarding my work which are not essential to me, things like meeting reminders and parking permit reminders. Important work emails are emails that missing would be a problem, such as the sending of important files, queries from my boss, a free dinner announcement and so on. The newsletters could be fixed by using a white-list since they have headers which make identification simple, however, a Bayesian filter should learn them by itself. After all I receive them regularly so there are multiple instances to train from. Antispam is unusable with this false positive rate.

Bayespam was the second worst filter according to the number of false positives. The false positives were made up of 13 newsletters, seven work emails, four important work emails, two personal emails, an email bounce, a domain renewal notice, and a paypal subscription receipt. This filter isn't much better than Antispam, far too many false positives to be useful especially when you consider that the false negatives were the fifth highest as well.

Bmf's false positives were as follows: two work emails, two newsletters, an important work email, a domain renewal notice, a personal email, and a paypal subscription receipt. This is much more reasonable than the previous two filters.

Bogofilter didn't produce any false positives.

Crm114 produced false positives consisting of: nine newsletters, seven work emails, five important work email, three personal emails, two email bounces, and a domain renewal notice.

Dbacl performed extremely poorly, missing the majority of the spam. This is made worse by the fact that it also produced false positives: three work emails, three newsletters, an important work email, a personal email, and a paypal subscription receipt.

DSPAM didn't produce any false positives, but that is balanced by the fact that is also didn't catch much of the spam.

Ifile only generated two false positives. One was the domain renewal notice, and the other was an important work email - namely some files sent to me via email as an attachment.

Qsf didn't produce any false positives while still having a respectable but not great spam catching rate.

SpamAssassin didn't produce any false positives but also missed over half the spam. The non-standard definition of "spam" being used, explains this.

SpamBayes only gave a single false positive which was a personal email consisting of an attached image with absolutely nothing else in the body of the email. This is reasonable except that the recall isn't wonderful and hence lots of spam got through.

SpamOracle classified the domain renewal notice and the paypal subscription receipt as spam just as Annoyance-filter did. However, the number of false negatives, or missed spams, was almost twelve times as high as for Annoyance-filter making this filter not very useful with 13 spams a day getting through.

SpamProbe gave ten false positives but managed to spread them around the categories: four mail bounces, an important work email, a work email, a personal email, a newsletter, a domain renewal notice, and a paypal subscription receipt.

Marking the domain renewal notice and the paypal subscription receipt as spam is almost unavoidable for a Bayesian filter (those that didn't also missed too much spam), since they have very "spam-like" content. The work emails really shouldn't be caught by the Bayesian filters since the content of the messages involves similar topics, and the words related to those topics should be good non-spam indicators. Of course some of the work emails are also "spam-like" being announcements of seminars which are essentially advertisements and hence contain "spammy" words.

Based upon this examination of the false positives Annoyance-filter is the best filter for my email. Just over one spam a day gets past it which is bearable (and which hopefully will be reduced as those emails are used for retraining).

External Training Spam

One problem with filters that need training is that in order to start using them you need to have a collection of both spam and non-spam emails in order to do the training. Most people have a collection of non-spam emails in the mailbox. However, most people will have deleted all the spam they have received which means they don't have any spam with which to train. An obvious solution is to use someone else's spam, there are a number of places from which spam can be downloaded. The SpamAssassin public corpus is one, another is Spam Archive.

Using such external spam is usually considered a bad thing, since the idea of Bayesian filtering is that the filter learns what your email looks like as well as what your spam looks like - making it learn what someone else's spam looks like just isn't the same. People want to get a head start at Bayesian filtering so I'm going to look at whether such a technique is useful.

I used a different email source, which contained 116 non-spam emails and 345 spam emails. I split those into ten sets as described earlier. 400 spam emails were taken from the Spam Archive, specifically from the file 305.r2.gz. These spams were also split into ten sets. The training and testing sequence was then performed twice. The first time just as it had been in the previous two evaluations, the second time the Spam Archive sets were used for training while the actual spam sets were used for testing.

Error Rates With Received Spam Chart

Figure 6

Error Rates With External Spam

Figure 7

The results are shown in Figure 6 and 7. Figure 6 shows the error rates using the actual spam that the address has received as the training data, Figure 7 shows the error rates using the Spam Archive spam for training data. Clearly the results of training with external spam are not as good as training with the actual spam received. However, many of the filters perform better than SpamAssassin even with the external training spam.

Table 3. External versus Receieved Spam Training
Filter Trained on Received Spam Trained on External Spam
Precision Recall Precision Recall
Annoyance-filter 98.3% 86.1% 99.5% 54.5%
Antispam 89.2% 95.7% 90.9% 89.9%
Bayespam 92.2% 92.2% 93.6% 84.9%
bmf 96.1% 93.9% 100.0% 55.1%
Bogofilter 99.6% 78.6% 100.0% 33.9%
crm114 95.4% 99.1% 98.7% 71.0%
dbacl 100.0% 2.9% 100.0% 2.3%
DSPAM 98.9% 79.7% 99.2% 70.4%
Ifile 96.6% 83.5% 99.1% 65.8%
qsf 98.8% 70.4% 98.5% 56.5%
SpamAssassin 98.5% 58.0% 98.5% 58.0%
SpamBayes 99.4% 91.5% 99.5% 61.7%
SpamOracle 98.8% 75.1% 98.3% 67.0%
SpamProbe 98.2% 94.8% 99.5% 55.1%

Table 3 shows the precision and recall for each of the filters using both sets of training spam. These results are not as good as the previous results, probably because the corpus of email was significantly smaller and hence less suitable for training purposes. This is however, reasonably realistic since most people don't keep hundreds of emails at hand and we are testing if using external spam for training is suitable for such users.

Interestingly the table indicates that precision is higher when training is done on the external spam than when it is done with the received spam. This may be due to the slightly larger collection of spam used when training with the external spam. The recalls are significantly lower, for most of the filters, when training with the external spam. This indicates that the external spam while being different from the non-spam emails (and thus resulting in reasonable precision) was not similar enough to the actual received spam to give useful recall.

Training with external spam caused the filters to miss much more of the spam than training them on a slightly smaller quantity of actual spam received at the email address. This is expected, since that is the reason that using external spam for training is not recommended - different email addresses receive slightly different spam profiles, that's the reason the Bayesian filters can achieve higher spam detection than software such as SpamAssassin (ignoring its Bayesian component for the moment).

An Important Note about CRM114

The crm114 documentation advises train using a "train on error" approach, in which the filter is only trained on emails that it classifies incorrectly. I originally used that technique but I was getting errors from crm114 on some of the emails during training so I also tried the bulk training on the data. Bulk training produced better results and so all the results above used it, however, the documentation says that this type of training results in a factor of two decrease in accuracy and speed. So the crm114 results should be taken with a grain of salt, halving the error rates may give a better indication of the performance crm114 will give. Others have reported much better accuracy from crm114.

These errors were worrying as they may indicate some sort of installation or usage error on my part, I strongly recommend giving crm114 a try and comparing its results with other filters you are or are considering using.

Runtime Performance

There is more to a spam filter than just how well it classifies spam and non-spam. How long it takes to perform the classification and how much memory it needs can also be important, especially if the mail is being filtered for a large number of people on a busy machine. The disk space used to store the classification database can also be important, especially since it is a resource that is used at all times and not just when actually training or classifying. Table 4 shows the runtime performance of the various filters when training on 3000 emails and then classifying 3000 emails (not the same emails that were trained on). The time recorded is the total time of the task while the memory is the maximum memory used at any time during the task. The disk usage reported is disk space used according to the "du" command of the database file(s) used by the filter after both the training and classification mentioned above has been performed.

Table 4. Runtime Performace
Filter Training Classifying
Time (CPU usage) Memory Time (CPU usage) Memory
Annoyance-filter 86 sec (99%) 5776 KB 293 sec (99%) 2196 KB
Antispam 146 sec (58%) 502 MB 1569 sec(99%) 48196 KB
Bayespam 67 sec (79%) 14088 KB 717 sec (99%) 8780 KB
bmf 144 sec (12%) 1604 KB 44 sec (82%) 1524 KB
Bogofilter 501 sec (8%) 2932 KB 23 sec (92%) 1364 KB
crm114 24 sec (77%) 20884 KB 632 sec (68%) 25560 KB
dbacl 425 sec (99%) 5228 KB 79 sec (99%) 4028 KB
DSPAM 1615 sec (5%) 2620 KB 3808 sec (3%) 4044 KB
Ifile 206 sec (98%) 1644 KB 54 sec (99%) 1476 KB
qsf 244 sec (96%) 1480 KB 64 sec (95%) 2076 KB
SpamAssassin 200 sec (95%) 17676 KB 3581 sec (80%) 25944 KB
SpamAssassin using spamd - - 571 sec (?%) 3188 KB
SpamBayes 144 sec (90%) 8604 KB 487 sec (98%) 6628 KB
SpamOracle 45 sec (75%) 6452 KB 22 sec (98%) 1188 KB
SpamProbe 121 sec (83%) 7608KB 89 sec (86%) 2460 KB

Antispam performs badly especially in respects to memory usage when training on a large corpus, I consider using over half a gigabyte of RAM as completely unacceptable especially when the next largest RAM usage is about 20 megabytes. SpamOracle and Bogofilter classify email very quickly without using much memory making them suitable for large scale environments. DSPAM is very slow at classifying but due to the low CPU usage percentage should work well in a large scale environment with lots of mail being classified in parallel.

Conclusion

There is no clear "best" filter since different filters did better with different email sets and provided different trade-offs between missing spam and generating false positives. The filters which gave the best results for a set of email at a desirable false positive rate were: Annoyance-filter, Bogofilter, SpamAssassin, and Spamprobe. Any of those filters would be reasonably choices.

SpamAssassin has the advantage of not being a Bayesian filter but having a Bayesian classification component. This allows it to work well without training while still benefiting from any training that is available. The disadvantage of SpamAssassin is that spammers can study its rules in order to craft their spams to achieve low scores, this requires that SpamAssassin be kept up to date in order to keep the rule set current. Bayesian filters on the other hand should only need upgrading due to bug fixes or new features.

A good way to start with a Bayesian filter is to start running the filter on your email and train it on each email that it classifies incorrectly. The error rate will quickly drop and soon enough you'll only be training the filter once every couple of weeks. Keeping your spam and non-spam in separate folders, instead of deleting them either on site or when you have finished with them, allows easy training of the filter at a later date. One problem many Bayesian filters have is that the token database grows over time which slows performance and uses up disk space - a good solution is to periodically delete the database and train the filter from scratch on your spam and non-spam. You should only train it on a reasonably sized sample of spam and non-spam how much email you get will determine how long that sample takes to generate and hence how often you should retrain (and how long you should keep your spam filed away).