Greg's Bogofilter Page
Introduction and general description
Bogofilter is a Bayesian spam filter, originally written by Eric S.
Raymond with the purpose of creating a compact, fast implementation of
the spam detection method described in Paul Graham's paper A Plan for Spam.
Bogofilter is now being developed by a team led by David Relson, with
major support from Adrian Otto, Matthias Andree, Gyepi SAM and others,
and the Bogofilter
project is hosted on SourceForge.
Gary Robinson took an interest in Graham's paper, and wrote an
insightful commentary in which he presented several suggestions for
possible improvements to the calculation method Graham had
developed. I had been following bogofilter since shortly after
esr released it, and it seemed worthwhile to me to evaluate Robinson's
suggestions. I therefore modified bogofilter to try them out. I
won't go into detail here, but to summarize, I found that two of Gary's
three suggestions made significant positive improvements to
bogofilter's discrimination capability. Accordingly, I finalized
a Robinson variant of esr's bogofilter 0.7 with the S and f(w)
calculations, and began using it to divert spam from my mailbox.
When bogofilter-0.7.4 appeared, I ported my mods to that and submitted
a patch to the bogofilter mailing list. David Relson was
interested, and his tests seemed to confirm my conclusion that
Robinson's calculations gave better results.
Since then, I've been involved with bogofilter in three ways.
I've been trying out various suggestions Gary has made -- he's working
on minimizing violations by our classification calculations of the
underlying assumptions on which Bayesian classification depends.
Where these have been successful, I've been passing my bogofilter
modifications along to the project, so that the improvements can be
incorporated into mainstream bogofilter. I'm also maintaining a
production version of bogofilter with a lot of options hard-coded,
which I use to filter email both for myself and for my
employer's computer users. It's my hope that this branch will
someday wither when mainstream bogofilter matures.
Improving the application of Bayes' theorem
My paper Better
Bayesian Bogofilter describes an improvement in the application of
Bayes' theorem to bogofilter's classification of messages.
Tuning bogofilter
There are several parameters that can be altered to improve
bogofilter's ability to distinguish spam from nonspam in a particular
working environment. While the defaults supplied with the
bogofilter distribution do a fair job in most circumstances, it's often
possible to reduce the number of spams that get through by half or
more, without increasing the number of nonspams misclassified as spam,
at the cost of a little experimentation. I've written a HOWTO with
instructions on getting the most out of your bogofilter
installation. I've also written a perl script that can automate
the tuning process; the README file
gives details, and the bogotune
package is available for download.
Does bogofilter scale?
The problem of maintaining the training database needs considering if
one's going to run a Bayesian spam filter. I've put together some thoughts about
this as a basis for discussion.
Experimentation
A major purpose of this web page is to serve as a
home for writeups of the various experiments I've done to compare
calculation methods and parameters used in bogofilter, and to evaluate
various ways of using and training bogofilter:
- Comparison of
Robinson's original proposal with Graham's.
- This compares
the Robinson geometric-mean calculation with the calculation as
performed by Paul Graham, in an experiment that also looks at some
effects of training.
- Comparison of
two variants of Robinson's method.
- Gary Robinson first
suggested a calculation involving the geometric means of the spam and
nonspam probabilities. He later proposed the application of
Fisher's method of combining probabilities, a technique that ought to
give the optimal test of whether a population of f(w) values is or is
not uniformly distributed. Fisher's method is helpful in that it
permits identification of cases where the discrimination is weak;
instead of classifying messages as spam or nonspam, one can make a
ternary spam/nonspam/uncertain decision, which can help in training
(see later experiments). This experiment also looks at the effect
of varying the deviation threshold above which a given token is
included in the calculation.
- Comparison of
Fisher with application of a beta distribution.
- Robinson
also suggested the possibility that we should, for some relatively
small number r, determine the rth-closest f(w) value to 0 and the
rth-closest f(w) value to 1; these will present beta distributions with
parameters r and n-r+1, where n is the number of f(w) values from among
which we chose. It seemed possible that the selected values, plus the
fact that there must be r-1 other f(w) values more extreme than those
chosen, would constitute adequate information on which to base a
decision.
- Comparison of
Fisher with Naïve Bayesian learning.
- Scott Lenser has
modified an early version of bogofilter to apply naïve Bayesian
learning. This experiment looks at discrimination by his
implementation in comparison with that of my implementation of
Robinson-Fisher.
- The Bayesian
chain rule.
- The Bayesian chain rule is used in a project
called crm114, and its use is
described on that site. This is yet another comparison experiment
to see how its performance differs from that of the Robinson-Fisher
method.
- Training on
error.
- The bogofilter package can be trained on error
only (only messages that it classifies wrongly or uncertainly are used
in training), or with every available message. This comparison
seems to indicate that with the Robinson-Fisher method, either approach
works, though training on error only will take longer to reach the
minimum error rate.
- Training on
error, continued.
- This experiment compares full training
with training on error, with many more messages and more rounds of
training than in the preceding test. It's also suggested that one
can save labour without significantly affecting the error rate if one
first builds a fully trained database with about 10,000 spams and
10,000 nonspams, then switches to training-on-error for further
maintenance.
- Bogofilter
parameters.
- There is a complex relationship among the
various parameters that govern bogofilter's spam-index
calculation. This experiment explores the effect on error rate of
variation in Robinson's s parameter and in the minimum deviation from
0.5 required for an individual token's f(w) value to be included in the
calculation (following on from the second
experiment in this list).
- Subject
tagging.
- Marking tokens derived from Subject: headers,
e.g. by prepending some flag like subj:, may assist in discrimination;
there is the possibility that this effect, if it exists, depends on the
minimum deviation (difference from 0.5) required for a token to be
included in the classification calculation. This experiment tests
both hypotheses.
- Bogofilter
parameters (continued).
- The relationship of Robinson's s
parameter and the minimum deviation to the discrimination capability of
Bogofilter are further explored in this experiment, with much larger
numbers of emails used in training and testing.
- Bogofilter
parameters (continued).
- Yet another, still more thorough,
investigation of the relationship of Robinson's s
parameter and the minimum deviation to the discrimination capability of
Bogofilter. The effect of very small s values is explained (s
should be kept at or above 0.01 to avoid distorting the spam score
calculation) and suggestions are given for tuning these
parameters.
- Effect of
training on bogofilter's optimum parameters.
- In this
experiment the effect of training on the optimum values of Robinson's s
and the minimum deviation is explored.
- Evaluation of
refinements
- suggested by Paul Graham:
preserving case, tagging tokens from To:, From:, Subject: and
Return-Path: headers, and extracting the contents of html A, IMG and
FONT tags.
- Bogofilter and
repetitive training
- Two experiments are presented that
explore the effect of training repeatedly with messages that bogofilter
classifies wrongly or as unsure.
- Bogofilter
and repetitive training, continued
- Training on error
alone, with or without repetition, is compared with full training and
with training on error after a period of full training.
- Default
parameters for bogofilter
- An attempt to determine suitable
default parameters for bogofilter.
- The Effective Size
Factor
- Evaluation of Gary Robinson's suggested improvement
of the chi-squared calculation.
I'll briefly describe some very early work that preceded the systematic
comparisons reported above. The first objective
was to ascertain whether the Robinson method gave better
discrimination, on the same data, than the original Graham
calculations; later, a second objective was added, namely, to see
whether Robinson's g(w) ranking scheme made a further improvement. For
each test I show the date on which it was run, the purpose, the sizes
of the training and test sets, and for each calculation method
attempted, the percentages of false negatives and false positives.
20020918
Graham, without or with a minimum deviation of 0.4, vs Robinson
msgcount_good = 5106
msgcount_bad = 1773
test=training
Graham_original Graham_mindev Robinson
false negative: 6.65% 3.0% 2.1%
false positive: 0.01% 0.01% 0
20020919
Graham vs Robinson as above
msgcount_good = 5106
msgcount_bad = 1773
test_msgcount_good = 144
test_msgcount_bad = 111
Graham_original Graham_mindev Robinson
false negative: 8.1% 6.3% 0.9%
false positive: 0.7% 0.7% 0.7%
NOTE: the one false positive detected by all 3 methods was
found to be due to a training artefact; in this test,
all three values should have been zero.
20020820
Graham vs Robinson (f(w)), as above, plus Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test = training
Graham_original Graham_mindev Robinson Robinson_g(w)
false negative: 7.7% 6.55% 3.4% 3.6%
false positive: 0.6% 0.6% 0.19% 0.19%
20020822
Robinson with f(w) vs Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test_msgcount_good = 120
test_msgcount_bad = 86
Robinson Robinson_g(w)
false negative: 2.3% 2.3%
false positive: 0 0
20020822
Robinson with f(w) vs Robinson with g(w)
msgcount_good = 5248
msgcount_bad = 2136
test_msgcount_good = 428
test_msgcount_bad = 382
Robinson Robinson_g(w)
false negative: 2.1% 1.8%
false positive: 0 0
From the above data I would conclude that (a) Robinson's f(w) and S
calculation method does give better results with my test sets than the
original method, with or without a deviation threshold; and (b)
Robinson's g(w) was, in these tests, not statistically significantly
more effective than f(w); if any real difference exists, it seems to be too
small to justify the additional computing complexity of g(w).
Users of either of Robinson's methods have access to the option -R,
which tells bogofilter to print a data frame suitable for use with the mathematics package R.
The data frame contains information about each word processed by
bogofilter, as well as the calculation results, so it's quite useful in
investigating misclassifications. Here's an example:
Token pgood pbad fw invfwlog fwlog U
1 glouis 0.986498 0.328654 0.249898 -0.28755 -1.38670 +
2 dynamicro.on.ca 0.233706 0.272581 0.538393 -0.77304 -0.61917 -
3 tue 0.217626 0.214905 0.496854 -0.68687 -0.69946 -
4 dec 0.064441 0.192363 0.749066 -1.38257 -0.28893 +
5 from 1.000000 1.000000 0.500000 -0.69315 -0.69315 -
6 greg 0.647478 0.005455 0.008354 -0.00839 -4.78497 +
7 louis 0.398061 0.006899 0.017036 -0.01718 -4.07244 +
8 gary 0.034246 0.001845 0.051122 -0.05247 -2.97355 +
9 robinson 0.016693 0.000802 0.045851 -0.04694 -3.08236 +
10 grobinson 0.006137 0.000080 0.012902 -0.01299 -4.35036 +
11 transpose.com 0.006137 0.000080 0.012902 -0.01299 -4.35036 +
12 subject 0.999386 0.999037 0.499913 -0.69297 -0.69332 -
13 testing 0.056585 0.013076 0.187703 -0.20789 -1.67289 +
14 farther 0.000982 0.000722 0.423707 -0.55114 -0.85871 -
15 reply-to 0.195164 0.522782 0.728164 -1.30255 -0.31723 +
16 mime-version 0.922671 0.838441 0.476086 -0.64643 -0.74216 -
17 content-type 0.927090 0.946414 0.505157 -0.70352 -0.68289 -
18 text 0.931018 0.947457 0.504376 -0.70194 -0.68443 -
19 plain 0.913097 0.579737 0.388347 -0.49159 -0.94586 +
20 iso-8859-1 0.261200 0.459811 0.637730 -1.01537 -0.44984 +
21 content-disposition 0.157850 0.148965 0.485522 -0.66460 -0.72253 -
22 inline 0.103719 0.121450 0.539373 -0.77517 -0.61735 -
23 organization 0.094882 0.026552 0.218656 -0.24674 -1.52026 +
24 dynamicro 0.011661 0.000562 0.045943 -0.04703 -3.08035 +
25 consulting 0.015711 0.008102 0.340231 -0.41587 -1.07813 +
26 limited 0.175402 0.078774 0.309920 -0.37095 -1.17144 +
27 x-confirm-reading-to 0.057567 0.000000 0.000000 -0.00000 -23.50485 +
28 status 0.967350 0.051500 0.050547 -0.05187 -2.98485 +
29 content-length 0.693384 0.464945 0.401393 -0.51315 -0.91282 -
30 lines 0.702590 0.468956 0.400288 -0.51131 -0.91557 -
31 again 0.152449 0.076047 0.332815 -0.40469 -1.10017 +
32 i'm 0.128391 0.037542 0.226249 -0.25650 -1.48612 +
33 trying 0.054130 0.020055 0.270332 -0.31517 -1.30810 +
34 the 0.868295 0.763035 0.467738 -0.63062 -0.75985 -
35 experiment 0.010556 0.001444 0.120328 -0.12821 -2.11754 +
36 with 0.999755 1.000000 0.500061 -0.69327 -0.69302 -
37 smaller 0.011170 0.003530 0.240119 -0.27459 -1.42662 +
38 corpus 0.002087 0.000160 0.071397 -0.07407 -2.63950 +
39 that 0.591874 0.407829 0.407950 -0.52416 -0.89661 -
40 consists 0.003191 0.002166 0.404292 -0.51800 -0.90562 -
41 spams 0.003069 0.001364 0.307674 -0.36770 -1.17872 +
42 and 0.710446 0.709450 0.499649 -0.69245 -0.69385 -
43 nonspams 0.002332 0.000000 0.000000 -0.00000 -20.29869 +
44 carefully 0.017798 0.017086 0.489802 -0.67296 -0.71375 -
45 minimize 0.003560 0.001364 0.276991 -0.32433 -1.28377 +
46 duplication 0.005033 0.000722 0.125461 -0.13406 -2.07576 +
47 oddities 0.001227 0.000000 0.000000 -0.00000 -19.65683 +
48 like 0.203633 0.272902 0.572680 -0.85022 -0.55743 -
49 uuencode 0.000614 0.000160 0.207240 -0.23223 -1.57388 +
50 will 0.425310 0.377988 0.470545 -0.63591 -0.75386 -
51 report 0.135387 0.080780 0.373691 -0.46791 -0.98433 +
52 later 0.043943 0.040270 0.478192 -0.65046 -0.73774 -
53 today 0.102123 0.195652 0.657046 -1.07016 -0.42000 +
54 gpg 0.186326 0.000401 0.002148 -0.00215 -6.14322 +
55 public 0.238247 0.142066 0.373551 -0.46769 -0.98470 +
56 key 0.124586 0.047249 0.274966 -0.32154 -1.29111 +
57 http 0.516509 0.823761 0.614623 -0.95353 -0.48675 +
58 www.bgl.nu 0.184117 0.000080 0.000436 -0.00044 -7.73901 +
59 finger 0.187799 0.002326 0.012236 -0.01231 -4.40339 +
60 bgl.nu 0.123972 0.002567 0.020286 -0.02049 -3.89782 +
61 P_Q_S_invs_logs_md 1.00e-00 0.00e+00 2.04e-09 -18.098 -150.810 0.10
Loading the above table into R and getting a histogram is as easy as
read.table("bogo2.tbl") -> bogo
attach(bogo)
hist(fw[1:length(fw)-1],plot=TRUE,breaks=20)
We have to exclude the last line from the histogram, since it's a
summary line and the value in the fw column is actually S, the
calculated index used to classify the message as spam or nonspam (a
value greater than 0.54 indicates spam).
From the histogram it's clear that this message contains
(unsurprisingly) a lot of words that don't contribute much to the
decision either way, but there are very few words with f(w) values
on the right (spam) side of the graph. There's a cluster of nonspam
values around 0.22 and another at 0.0; these suffice to identify the
message as nonspam (S=2e-09, very close to zero and far below cutoff
value 0.99).
| |
Last modified 2004-03-24 -- feedback to Greg