The Bayesian filter used in Emmesmail for fighting spam was derived from the one described by Paul Graham. We here describe how our filter differs from the one he described, and we give our results in order that this information may be useful to others designing Bayesian filters.
We found that in our implemention of the Bayesian filter described by Paul Graham, the following parameters needed to be defined.
Parameter |
Definition |
Value chosen |
MAXW |
Maximum number of tokens allowed in hash table |
20000 |
WMIN |
Minimum length of a hash table token |
2 |
WMAX |
Maximum length of a hash table token |
40 |
PMIN |
Minimum probabilty of a token |
0.01 |
PMAX |
Maximum probabilty of a token |
0.99 |
PUNK |
Probability given a token not seen previously |
0.5 |
MINO |
Minimum number of times a token must appear in corpi to count |
5 |
MAXN |
Maximum number of emails in each corpus |
200 |
NAT |
Sufficient tokens above threshold to avoid re-search with raised threshold |
10 |
AFPB |
Anti false-positive bias factor |
2.0 |
CUT |
Likelihood above which an email is considered spam |
0.9 |
- |
Characters which act as token separators |
\040, \011, \012 |
WMIN: Set to 2 to avoid examining single letters.
WMAX: This eliminates long undecipherable tokens as occur with pdf documents.
PMIN, PMAX: Not 0 or 1, in order to avoid division by zero in the calculations.
MINO: A word must occur at least five times in our corpi to be significant with regard to determining whether an email is spam.
MAXN: When one of our corpi gets to contain 200 emails, we reduce it to include only the 100 most recent and then add new ones until the total number is again 200.
NAT: This is where our filter works differently from that
described by Paul Graham. In the filter described originally, only the 15 most
signicant tokens were used in the calculations, where the most significant
tokens have a probability that is close to either 0 or 1. The procedure we use
is to consider tokens whose difference from 0 or 1 is k*0.1 or less, where k
initially is 1. If we don't find at least NAT = 10 tokens, we repeat the
calculations with k = 2. Until May 1, 2003 we allowed k to be incremented
further. After May 1, the likelihood of an email being spam was calculated
after k reached 2, regardless of the number of tokens above threshold. Like the
original Paul Graham filter, we calculate the likelihood of an email being spam
according to the formula
Likelihood = pspam/(pspam + pnspam)
where pspam = w1*w2*w3*....wn, and pnspam = (1-w1)*(1-w2)*...(1-wn),
and we reject emails whole likelihood of spam is greater than 0.9. Our
change from the original protocol was made for programming convenience. We have
no idea if it makes the filter better or worse.
AFPB: AFPB stands for anti false-positive bias. The weights, wn, strictly should be calculated according to the formula
wn = a/( a + b )
where a and b are the frequency of the word in the spam and non-spam
corpi respectively. The description of the original filter recommended counting
the words in the non-spam corpus twice in order to reduce the incidence of false
positives. In our implementation this amounts to using the formula
wn = a/( a + b*AFPB )
where AFPB = 2.0. This turned out to be very important. Having AFPB = 1.0 resulted in about 20% of valid emails being labeled as spam, whereas setting it to 2.0 resulted in almost zero false positives.
Spam Emails Rec. |
Spam Emails Rej. |
Rej. Rate (%) |
Valid Emails Rec. |
Valid Emails Rej. |
% False Pos. (%) |
39 |
35 |
90 |
157 |
1 |
0.6 |
Our rejection rate for spam between March 1 and April 30 was not as good as that reported by Paul Graham. We therefore made the following changes to our parameters: We added '?' to our list of token separators, to trap the handful of spams we received that had the token "ISO-8859-3" in the subject line followed by a base-64 encoded message and we also added '@' to make tokens of domains and catch the large number of spams from yahoo.com. We reduced PMIN from 0.01 to 0.0001 and increased PMAX from 0.99 to 0.9999. We reduced the anti-false-positive bias factor to 1.2 and reduced the likelihood above which an email is considered spam from 0.9 to 0.4. Our filter then gave the following results:
Spam Emails Rec. |
Spam Emails Rej. |
Rej. Rate (%) |
Valid Emails Rec. |
Valid Emails Rej. |
% False Pos. (%) |
29 |
26 |
90 |
73 |
1 |
1.4 |
After our technique, which we considered simpler to program than that of Paul Graham's, failed to achieve results as good as his for the second time, we decided finally to replicate the Graham technique as closely as possible (by ranking the tokens and considering just the 15 most significant, by raising the threshhold for spam back to 0.9 and increasing the AFPB to 1.6). After still failing to achieve Paul Graham's result, we discovered several bugs in our code (which likely affected our earlier results as well). Because of having received numerous spam emails with only base64 encoded bodies, we decided to treat the header part of the email differently from the body, defining HAFPB, the AFPB for the tokens found in the header, and BAFPB, the AFPB for the body tokens. Our results were then:
Month |
Spam Emails Rec. |
Spam Emails Rej. |
Rej. Rate (%) |
Valid Emails Rec. |
Valid Emails Rej. |
False Pos. (%) |
Change |
Jul |
22 |
22 |
100 |
54 |
5 |
9.3 |
HAFPB=1.0 |
Aug |
20 |
20 |
100 |
66 |
3 |
4.5 |
BAFPB=2.0 |
Sep |
24 |
23 |
95 |
84 |
8 |
9.5 |
HAFPB=1.2 |
Oct |
16 |
13 |
81 |
47 |
2 |
4 |
BAFPB=2.2 |
Nov |
52 |
47 |
90 |
111 |
4 |
3.6 |
- |
Dec |
74 |
70 |
95 |
90 |
4 |
4.4 |
HAFPB=1.1 |
Year |
Spam Emails Rec. |
Spam Emails Rej. |
Rej. Rate (%) |
Valid Emails Rec. |
Valid Emails Rej. |
False Pos. (%) |
2003 |
276 |
256 |
92.8 |
682 |
28 |
4.1 |
2004 |
1173 |
1099 |
93.7 |
834 |
15 |
1.8 |
2005 |
1204 |
1133 |
94.1 |
1 |
330 |
0.3 |
| Emmesmail's Homepage | Emmes Technologies Homepage | Technical details |