Greg's Bogofilter Page

Introduction and general description

Bogofilter is a Bayesian spam filter, originally written by Eric S. Raymond with the purpose of creating a compact, fast implementation of the spam detection method described in Paul Graham's paper A Plan for Spam. Bogofilter is now being developed by a team led by David Relson, with major support from Adrian Otto, Matthias Andree, Gyepi SAM and others, and the Bogofilter project is hosted on SourceForge.

Gary Robinson took an interest in Graham's paper, and wrote an insightful commentary in which he presented several suggestions for possible improvements to the calculation method Graham had developed. I had been following bogofilter since shortly after esr released it, and it seemed worthwhile to me to evaluate Robinson's suggestions. I therefore modified bogofilter to try them out. I won't go into detail here, but to summarize, I found that two of Gary's three suggestions made significant positive improvements to bogofilter's discrimination capability. Accordingly, I finalized a Robinson variant of esr's bogofilter 0.7 with the S and f(w) calculations, and began using it to divert spam from my mailbox. When bogofilter-0.7.4 appeared, I ported my mods to that and submitted a patch to the bogofilter mailing list. David Relson was interested, and his tests seemed to confirm my conclusion that Robinson's calculations gave better results.

Since then, I've been involved with bogofilter in three ways. I've been trying out various suggestions Gary has made -- he's working on minimizing violations by our classification calculations of the underlying assumptions on which Bayesian classification depends. Where these have been successful, I've been passing my bogofilter modifications along to the project, so that the improvements can be incorporated into mainstream bogofilter. I'm also maintaining a production version of bogofilter with a lot of options hard-coded, which I use to filter email both for myself and for my employer's computer users. It's my hope that this branch will someday wither when mainstream bogofilter matures.

Improving the application of Bayes' theorem

My paper Better Bayesian Bogofilter describes an improvement in the application of Bayes' theorem to bogofilter's classification of messages.

Tuning bogofilter

There are several parameters that can be altered to improve bogofilter's ability to distinguish spam from nonspam in a particular working environment. While the defaults supplied with the bogofilter distribution do a fair job in most circumstances, it's often possible to reduce the number of spams that get through by half or more, without increasing the number of nonspams misclassified as spam, at the cost of a little experimentation. I've written a HOWTO with instructions on getting the most out of your bogofilter installation. I've also written a perl script that can automate the tuning process; the README file gives details, and the bogotune package is available for download.

Does bogofilter scale?

The problem of maintaining the training database needs considering if one's going to run a Bayesian spam filter. I've put together some thoughts about this as a basis for discussion.

Experimentation

A major purpose of this web page is to serve as a home for writeups of the various experiments I've done to compare calculation methods and parameters used in bogofilter, and to evaluate various ways of using and training bogofilter:

Comparison of Robinson's original proposal with Graham's.: This compares the Robinson geometric-mean calculation with the calculation as performed by Paul Graham, in an experiment that also looks at some effects of training.
Comparison of two variants of Robinson's method.: Gary Robinson first suggested a calculation involving the geometric means of the spam and nonspam probabilities. He later proposed the application of Fisher's method of combining probabilities, a technique that ought to give the optimal test of whether a population of f(w) values is or is not uniformly distributed. Fisher's method is helpful in that it permits identification of cases where the discrimination is weak; instead of classifying messages as spam or nonspam, one can make a ternary spam/nonspam/uncertain decision, which can help in training (see later experiments). This experiment also looks at the effect of varying the deviation threshold above which a given token is included in the calculation.
Comparison of Fisher with application of a beta distribution.: Robinson also suggested the possibility that we should, for some relatively small number r, determine the rth-closest f(w) value to 0 and the rth-closest f(w) value to 1; these will present beta distributions with parameters r and n-r+1, where n is the number of f(w) values from among which we chose. It seemed possible that the selected values, plus the fact that there must be r-1 other f(w) values more extreme than those chosen, would constitute adequate information on which to base a decision.
Comparison of Fisher with Naïve Bayesian learning.: Scott Lenser has modified an early version of bogofilter to apply naïve Bayesian learning. This experiment looks at discrimination by his implementation in comparison with that of my implementation of Robinson-Fisher.
The Bayesian chain rule.: The Bayesian chain rule is used in a project called crm114, and its use is described on that site. This is yet another comparison experiment to see how its performance differs from that of the Robinson-Fisher method.
Training on error.: The bogofilter package can be trained on error only (only messages that it classifies wrongly or uncertainly are used in training), or with every available message. This comparison seems to indicate that with the Robinson-Fisher method, either approach works, though training on error only will take longer to reach the minimum error rate.
Training on error, continued.: This experiment compares full training with training on error, with many more messages and more rounds of training than in the preceding test. It's also suggested that one can save labour without significantly affecting the error rate if one first builds a fully trained database with about 10,000 spams and 10,000 nonspams, then switches to training-on-error for further maintenance.
Bogofilter parameters.: There is a complex relationship among the various parameters that govern bogofilter's spam-index calculation. This experiment explores the effect on error rate of variation in Robinson's s parameter and in the minimum deviation from 0.5 required for an individual token's f(w) value to be included in the calculation (following on from the second experiment in this list).
Subject tagging.: Marking tokens derived from Subject: headers, e.g. by prepending some flag like subj:, may assist in discrimination; there is the possibility that this effect, if it exists, depends on the minimum deviation (difference from 0.5) required for a token to be included in the classification calculation. This experiment tests both hypotheses.
Bogofilter parameters (continued).: The relationship of Robinson's s parameter and the minimum deviation to the discrimination capability of Bogofilter are further explored in this experiment, with much larger numbers of emails used in training and testing.
Bogofilter parameters (continued).: Yet another, still more thorough, investigation of the relationship of Robinson's s parameter and the minimum deviation to the discrimination capability of Bogofilter. The effect of very small s values is explained (s should be kept at or above 0.01 to avoid distorting the spam score calculation) and suggestions are given for tuning these parameters.
Effect of training on bogofilter's optimum parameters.: In this experiment the effect of training on the optimum values of Robinson's s and the minimum deviation is explored.
Evaluation of refinements: suggested by Paul Graham: preserving case, tagging tokens from To:, From:, Subject: and Return-Path: headers, and extracting the contents of html A, IMG and FONT tags.
Bogofilter and repetitive training: Two experiments are presented that explore the effect of training repeatedly with messages that bogofilter classifies wrongly or as unsure.
Bogofilter and repetitive training, continued: Training on error alone, with or without repetition, is compared with full training and with training on error after a period of full training.
Default parameters for bogofilter: An attempt to determine suitable default parameters for bogofilter.
The Effective Size Factor: Evaluation of Gary Robinson's suggested improvement of the chi-squared calculation.

I'll briefly describe some very early work that preceded the systematic comparisons reported above. The first objective was to ascertain whether the Robinson method gave better discrimination, on the same data, than the original Graham calculations; later, a second objective was added, namely, to see whether Robinson's g(w) ranking scheme made a further improvement. For each test I show the date on which it was run, the purpose, the sizes of the training and test sets, and for each calculation method attempted, the percentages of false negatives and false positives.

 20020918
  Graham, without or with a minimum deviation of 0.4, vs Robinson
  msgcount_good = 5106
  msgcount_bad  = 1773
  test=training
                  Graham_original  Graham_mindev  Robinson
  false negative:      6.65%           3.0%         2.1%
  false positive:      0.01%           0.01%         0


20020919
  Graham vs Robinson as above
  msgcount_good = 5106
  msgcount_bad  = 1773
  test_msgcount_good = 144
  test_msgcount_bad  = 111
                  Graham_original  Graham_mindev  Robinson
  false negative:       8.1%           6.3%         0.9%
  false positive:       0.7%           0.7%         0.7%
NOTE: the one false positive detected by all 3 methods was
      found to be due to a training artefact; in this test,
      all three values should have been zero.

20020820
  Graham vs Robinson (f(w)), as above, plus Robinson with g(w)
  msgcount_good = 5248
  msgcount_bad  = 2136
  test = training
                  Graham_original  Graham_mindev  Robinson  Robinson_g(w)
  false negative:       7.7%           6.55%        3.4%        3.6%
  false positive:       0.6%           0.6%         0.19%       0.19%

20020822
  Robinson with f(w) vs Robinson with g(w)
  msgcount_good = 5248
  msgcount_bad  = 2136
  test_msgcount_good = 120
  test_msgcount_bad  =  86
                  Robinson  Robinson_g(w)
  false negative:   2.3%       2.3%
  false positive:    0          0

20020822
  Robinson with f(w) vs Robinson with g(w)
  msgcount_good = 5248
  msgcount_bad  = 2136
  test_msgcount_good = 428
  test_msgcount_bad  = 382
                  Robinson  Robinson_g(w)
  false negative:   2.1%       1.8%
  false positive:    0          0

From the above data I would conclude that (a) Robinson's f(w) and S calculation method does give better results with my test sets than the original method, with or without a deviation threshold; and (b) Robinson's g(w) was, in these tests, not statistically significantly more effective than f(w); if any real difference exists, it seems to be too small to justify the additional computing complexity of g(w).

Users of either of Robinson's methods have access to the option -R, which tells bogofilter to print a data frame suitable for use with the mathematics package R. The data frame contains information about each word processed by bogofilter, as well as the calculation results, so it's quite useful in investigating misclassifications. Here's an example:


     Token                    pgood      pbad        fw  invfwlog     fwlog U
  1  glouis                0.986498  0.328654  0.249898  -0.28755  -1.38670 +
  2  dynamicro.on.ca       0.233706  0.272581  0.538393  -0.77304  -0.61917 -
  3  tue                   0.217626  0.214905  0.496854  -0.68687  -0.69946 -
  4  dec                   0.064441  0.192363  0.749066  -1.38257  -0.28893 +
  5  from                  1.000000  1.000000  0.500000  -0.69315  -0.69315 -
  6  greg                  0.647478  0.005455  0.008354  -0.00839  -4.78497 +
  7  louis                 0.398061  0.006899  0.017036  -0.01718  -4.07244 +
  8  gary                  0.034246  0.001845  0.051122  -0.05247  -2.97355 +
  9  robinson              0.016693  0.000802  0.045851  -0.04694  -3.08236 +
 10  grobinson             0.006137  0.000080  0.012902  -0.01299  -4.35036 +
 11  transpose.com         0.006137  0.000080  0.012902  -0.01299  -4.35036 +
 12  subject               0.999386  0.999037  0.499913  -0.69297  -0.69332 -
 13  testing               0.056585  0.013076  0.187703  -0.20789  -1.67289 +
 14  farther               0.000982  0.000722  0.423707  -0.55114  -0.85871 -
 15  reply-to              0.195164  0.522782  0.728164  -1.30255  -0.31723 +
 16  mime-version          0.922671  0.838441  0.476086  -0.64643  -0.74216 -
 17  content-type          0.927090  0.946414  0.505157  -0.70352  -0.68289 -
 18  text                  0.931018  0.947457  0.504376  -0.70194  -0.68443 -
 19  plain                 0.913097  0.579737  0.388347  -0.49159  -0.94586 +
 20  iso-8859-1            0.261200  0.459811  0.637730  -1.01537  -0.44984 +
 21  content-disposition   0.157850  0.148965  0.485522  -0.66460  -0.72253 -
 22  inline                0.103719  0.121450  0.539373  -0.77517  -0.61735 -
 23  organization          0.094882  0.026552  0.218656  -0.24674  -1.52026 +
 24  dynamicro             0.011661  0.000562  0.045943  -0.04703  -3.08035 +
 25  consulting            0.015711  0.008102  0.340231  -0.41587  -1.07813 +
 26  limited               0.175402  0.078774  0.309920  -0.37095  -1.17144 +
 27  x-confirm-reading-to  0.057567  0.000000  0.000000  -0.00000 -23.50485 +
 28  status                0.967350  0.051500  0.050547  -0.05187  -2.98485 +
 29  content-length        0.693384  0.464945  0.401393  -0.51315  -0.91282 -
 30  lines                 0.702590  0.468956  0.400288  -0.51131  -0.91557 -
 31  again                 0.152449  0.076047  0.332815  -0.40469  -1.10017 +
 32  i'm                   0.128391  0.037542  0.226249  -0.25650  -1.48612 +
 33  trying                0.054130  0.020055  0.270332  -0.31517  -1.30810 +
 34  the                   0.868295  0.763035  0.467738  -0.63062  -0.75985 -
 35  experiment            0.010556  0.001444  0.120328  -0.12821  -2.11754 +
 36  with                  0.999755  1.000000  0.500061  -0.69327  -0.69302 -
 37  smaller               0.011170  0.003530  0.240119  -0.27459  -1.42662 +
 38  corpus                0.002087  0.000160  0.071397  -0.07407  -2.63950 +
 39  that                  0.591874  0.407829  0.407950  -0.52416  -0.89661 -
 40  consists              0.003191  0.002166  0.404292  -0.51800  -0.90562 -
 41  spams                 0.003069  0.001364  0.307674  -0.36770  -1.17872 +
 42  and                   0.710446  0.709450  0.499649  -0.69245  -0.69385 -
 43  nonspams              0.002332  0.000000  0.000000  -0.00000 -20.29869 +
 44  carefully             0.017798  0.017086  0.489802  -0.67296  -0.71375 -
 45  minimize              0.003560  0.001364  0.276991  -0.32433  -1.28377 +
 46  duplication           0.005033  0.000722  0.125461  -0.13406  -2.07576 +
 47  oddities              0.001227  0.000000  0.000000  -0.00000 -19.65683 +
 48  like                  0.203633  0.272902  0.572680  -0.85022  -0.55743 -
 49  uuencode              0.000614  0.000160  0.207240  -0.23223  -1.57388 +
 50  will                  0.425310  0.377988  0.470545  -0.63591  -0.75386 -
 51  report                0.135387  0.080780  0.373691  -0.46791  -0.98433 +
 52  later                 0.043943  0.040270  0.478192  -0.65046  -0.73774 -
 53  today                 0.102123  0.195652  0.657046  -1.07016  -0.42000 +
 54  gpg                   0.186326  0.000401  0.002148  -0.00215  -6.14322 +
 55  public                0.238247  0.142066  0.373551  -0.46769  -0.98470 +
 56  key                   0.124586  0.047249  0.274966  -0.32154  -1.29111 +
 57  http                  0.516509  0.823761  0.614623  -0.95353  -0.48675 +
 58  www.bgl.nu            0.184117  0.000080  0.000436  -0.00044  -7.73901 +
 59  finger                0.187799  0.002326  0.012236  -0.01231  -4.40339 +
 60  bgl.nu                0.123972  0.002567  0.020286  -0.02049  -3.89782 +
 61  P_Q_S_invs_logs_md 1.00e-00  0.00e+00  2.04e-09   -18.098  -150.810 0.10

Loading the above table into R and getting a histogram is as easy as

read.table("bogo2.tbl") -> bogo
attach(bogo)
hist(fw[1:length(fw)-1],plot=TRUE,breaks=20)

We have to exclude the last line from the histogram, since it's a summary line and the value in the fw column is actually S, the calculated index used to classify the message as spam or nonspam (a value greater than 0.54 indicates spam).

From the histogram it's clear that this message contains (unsurprisingly) a lot of words that don't contribute much to the decision either way, but there are very few words with f(w) values on the right (spam) side of the graph. There's a cluster of nonspam values around 0.22 and another at 0.0; these suffice to identify the message as nonspam (S=2e-09, very close to zero and far below cutoff value 0.99).

Last modified 2004-03-24 -- feedback to Greg