Emmes Technologies' experience using Bayesian filters to fight spam

Emmes Technologies' experience using Bayesian filters to fight spam


Emmesmail's Implementation of a Bayesian Filter

The Bayesian filter used in Emmesmail for fighting spam was derived from the one described by Paul Graham. We here describe how our filter differs from the one he described, and we give our results in order that this information may be useful to others designing Bayesian filters.

We found that in our implemention of the Bayesian filter described by Paul Graham, the following parameters needed to be defined.

Parameter

Definition

Value chosen

MAXW

Maximum number of tokens allowed in hash table

20000

WMIN

Minimum length of a hash table token

2

WMAX

Maximum length of a hash table token

40

PMIN

Minimum probabilty of a token

0.01

PMAX

Maximum probabilty of a token

0.99

PUNK

Probability given a token not seen previously

0.5

MINO

Minimum number of times a token must appear in corpi to count

5

MAXN

Maximum number of emails in each corpus

200

NAT

Sufficient tokens above threshold to avoid re-search with raised threshold

10

AFPB

Anti false-positive bias factor

2.0

CUT

Likelihood above which an email is considered spam

0.9

-

Characters which act as token separators

\040, \011, \012


WMIN: Set to 2 to avoid examining single letters.

WMAX: This eliminates long undecipherable tokens as occur with pdf documents.

PMIN, PMAX: Not 0 or 1, in order to avoid division by zero in the calculations.

MINO: A word must occur at least five times in our corpi to be significant with regard to determining whether an email is spam.

MAXN: When one of our corpi gets to contain 200 emails, we reduce it to include only the 100 most recent and then add new ones until the total number is again 200.

NAT: This is where our filter works differently from that described by Paul Graham. In the filter described originally, only the 15 most signicant tokens were used in the calculations, where the most significant tokens have a probability that is close to either 0 or 1. The procedure we use is to consider tokens whose difference from 0 or 1 is k*0.1 or less, where k initially is 1. If we don't find at least NAT = 10 tokens, we repeat the calculations with k = 2. Until May 1, 2003 we allowed k to be incremented further. After May 1, the likelihood of an email being spam was calculated after k reached 2, regardless of the number of tokens above threshold. Like the original Paul Graham filter, we calculate the likelihood of an email being spam according to the formula

Likelihood = pspam/(pspam + pnspam)

where pspam = w1*w2*w3*....wn, and pnspam = (1-w1)*(1-w2)*...(1-wn),
and we reject emails whole likelihood of spam is greater than 0.9. Our change from the original protocol was made for programming convenience. We have no idea if it makes the filter better or worse.

AFPB: AFPB stands for anti false-positive bias. The weights, wn, strictly should be calculated according to the formula

wn = a/( a + b )

where a and b are the frequency of the word in the spam and non-spam corpi respectively. The description of the original filter recommended counting the words in the non-spam corpus twice in order to reduce the incidence of false positives. In our implementation this amounts to using the formula

wn = a/( a + b*AFPB )

where AFPB = 2.0. This turned out to be very important. Having AFPB = 1.0 resulted in about 20% of valid emails being labeled as spam, whereas setting it to 2.0 resulted in almost zero false positives.


Emmesmail's Bayesian Filter Performance, Mar 1, 2003 - Apr 30, 2003

Spam Emails Rec.

Spam Emails Rej.

Rej. Rate (%)

Valid Emails Rec.

Valid Emails Rej.

% False Pos. (%)

39

35

90

157

1

0.6


Our rejection rate for spam between March 1 and April 30 was not as good as that reported by Paul Graham. We therefore made the following changes to our parameters: We added '?' to our list of token separators, to trap the handful of spams we received that had the token "ISO-8859-3" in the subject line followed by a base-64 encoded message and we also added '@' to make tokens of domains and catch the large number of spams from yahoo.com. We reduced PMIN from 0.01 to 0.0001 and increased PMAX from 0.99 to 0.9999. We reduced the anti-false-positive bias factor to 1.2 and reduced the likelihood above which an email is considered spam from 0.9 to 0.4. Our filter then gave the following results:

Emmesmail's Bayesian Filter Performance, May 1, 2003 - June 10, 2003

Spam Emails Rec.

Spam Emails Rej.

Rej. Rate (%)

Valid Emails Rec.

Valid Emails Rej.

% False Pos. (%)

29

26

90

73

1

1.4



After our technique, which we considered simpler to program than that of Paul Graham's, failed to achieve results as good as his for the second time, we decided finally to replicate the Graham technique as closely as possible (by ranking the tokens and considering just the 15 most significant, by raising the threshhold for spam back to 0.9 and increasing the AFPB to 1.6). After still failing to achieve Paul Graham's result, we discovered several bugs in our code (which likely affected our earlier results as well). Because of having received numerous spam emails with only base64 encoded bodies, we decided to treat the header part of the email differently from the body, defining HAFPB, the AFPB for the tokens found in the header, and BAFPB, the AFPB for the body tokens. Our results were then:


Emmesmail's Bayesian Filter Performance - Version 6

Month

Spam Emails Rec.

Spam Emails Rej.

Rej. Rate (%)

Valid Emails Rec.

Valid Emails Rej.

False Pos. (%)

Change

Jul

22

22

100

54

5

9.3

HAFPB=1.0

Aug

20

20

100

66

3

4.5

BAFPB=2.0

Sep

24

23

95

84

8

9.5

HAFPB=1.2

Oct

16

13

81

47

2

4

BAFPB=2.2

Nov

52

47

90

111

4

3.6

-

Dec

74

70

95

90

4

4.4

HAFPB=1.1



Summary Statistics

Year

Spam Emails Rec.

Spam Emails Rej.

Rej. Rate (%)

Valid Emails Rec.

Valid Emails Rej.

False Pos. (%)

2003

276

256

92.8

682

28

4.1

2004

1173

1099

93.7

834

15

1.8

2005

1204

1133

94.1

1

330

0.3

current parameters  Since mid-2004, use of a whitelist and blacklist has augmented pure Bayesian filtering.    Notes

| Emmesmail's Homepage | Emmes Technologies Homepage | Technical details |





Emmes Technologies
Updated Apr 3, 2005

valid html 4.01!