## Emmes Technologies' experience using Bayesian filters to fight spam

### Emmesmail's Implementation of a Bayesian Filter

The Bayesian filter used in Emmesmail for fighting spam was derived from the one described by Paul Graham. We here describe how our filter differs from the one he described, and we give our results in order that this information may be useful to others designing Bayesian filters.

We found that in our implemention of the Bayesian filter described by Paul Graham, the following parameters needed to be defined.

Parameter

Definition

Value chosen

#### \040, \011, \012

WMIN: Set to 2 to avoid examining single letters.

WMAX: This eliminates long undecipherable tokens as occur with pdf documents.

PMIN, PMAX: Not 0 or 1, in order to avoid division by zero in the calculations.

MINO: A word must occur at least five times in our corpi to be significant with regard to determining whether an email is spam.

MAXN: When one of our corpi gets to contain 200 emails, we reduce it to include only the 100 most recent and then add new ones until the total number is again 200.

NAT: This is where our filter works differently from that described by Paul Graham. In the filter described originally, only the 15 most signicant tokens were used in the calculations, where the most significant tokens have a probability that is close to either 0 or 1. The procedure we use is to consider tokens whose difference from 0 or 1 is k*0.1 or less, where k initially is 1. If we don't find at least NAT = 10 tokens, we repeat the calculations with k = 2. Until May 1, 2003 we allowed k to be incremented further. After May 1, the likelihood of an email being spam was calculated after k reached 2, regardless of the number of tokens above threshold. Like the original Paul Graham filter, we calculate the likelihood of an email being spam according to the formula

Likelihood = pspam/(pspam + pnspam)

where pspam = w1*w2*w3*....wn, and pnspam = (1-w1)*(1-w2)*...(1-wn),
and we reject emails whole likelihood of spam is greater than 0.9. Our change from the original protocol was made for programming convenience. We have no idea if it makes the filter better or worse.

AFPB: AFPB stands for anti false-positive bias. The weights, wn, strictly should be calculated according to the formula

wn = a/( a + b )

where a and b are the frequency of the word in the spam and non-spam corpi respectively. The description of the original filter recommended counting the words in the non-spam corpus twice in order to reduce the incidence of false positives. In our implementation this amounts to using the formula

wn = a/( a + b*AFPB )

where AFPB = 2.0. This turned out to be very important. Having AFPB = 1.0 resulted in about 20% of valid emails being labeled as spam, whereas setting it to 2.0 resulted in almost zero false positives.

### Emmesmail's Bayesian Filter Performance, Mar 1, 2003 - Apr 30, 2003

 Spam Emails Rec. Spam Emails Rej. Rej. Rate (%) Valid Emails Rec. Valid Emails Rej. % False Pos. (%) 39 35 90 157 1 0.6

Our rejection rate for spam between March 1 and April 30 was not as good as that reported by Paul Graham. We therefore made the following changes to our parameters: We added '?' to our list of token separators, to trap the handful of spams we received that had the token "ISO-8859-3" in the subject line followed by a base-64 encoded message and we also added '@' to make tokens of domains and catch the large number of spams from yahoo.com. We reduced PMIN from 0.01 to 0.0001 and increased PMAX from 0.99 to 0.9999. We reduced the anti-false-positive bias factor to 1.2 and reduced the likelihood above which an email is considered spam from 0.9 to 0.4. Our filter then gave the following results:

### Emmesmail's Bayesian Filter Performance, May 1, 2003 - June 10, 2003

 Spam Emails Rec. Spam Emails Rej. Rej. Rate (%) Valid Emails Rec. Valid Emails Rej. % False Pos. (%) 29 26 90 73 1 1.4

After our technique, which we considered simpler to program than that of Paul Graham's, failed to achieve results as good as his for the second time, we decided finally to replicate the Graham technique as closely as possible (by ranking the tokens and considering just the 15 most significant, by raising the threshhold for spam back to 0.9 and increasing the AFPB to 1.6). After still failing to achieve Paul Graham's result, we discovered several bugs in our code (which likely affected our earlier results as well). Because of having received numerous spam emails with only base64 encoded bodies, we decided to treat the header part of the email differently from the body, defining HAFPB, the AFPB for the tokens found in the header, and BAFPB, the AFPB for the body tokens. Our results were then:

### Emmesmail's Bayesian Filter Performance - Version 6

 Month Spam Emails Rec. Spam Emails Rej. Rej. Rate (%) Valid Emails Rec. Valid Emails Rej. False Pos. (%) Change Jul 22 22 100 54 5 9.3 HAFPB=1.0 Aug 20 20 100 66 3 4.5 BAFPB=2.0 Sep 24 23 95 84 8 9.5 HAFPB=1.2 Oct 16 13 81 47 2 4 BAFPB=2.2 Nov 52 47 90 111 4 3.6 - Dec 74 70 95 90 4 4.4 HAFPB=1.1

### Summary Statistics

 Year Spam Emails Rec. Spam Emails Rej. Rej. Rate (%) Valid Emails Rec. Valid Emails Rej. False Pos. (%) 2003 276 256 92.8 682 28 4.1 2004 1173 1099 93.7 834 15 1.8 2005 1204 1133 94.1 1 330 0.3
current parameters  Since mid-2004, use of a whitelist and blacklist has augmented pure Bayesian filtering.    Notes

Emmes Technologies
Updated Apr 3, 2005