USING SpamAssassin WITH WIN32

Draft 40, Michael Bell guineverehelp|AT|yahoo.com (despammed address)

Last Update: November 8, 2005

(c) 2002-2005 by Michael Bell, distribution freely permitted (see below)

Original Document: http://www.openhandhome.com/howtosa.html
This domain is moving. Until it has moved, try here if the above link fails.
(Check this URL for updates, as any copy you printed out or received in a file may be outdated.)

Note 1: This version of the HOWTO is for SpamAssassin 3.10! If you wish to install older versions please see these links to older documents, which will no longer be updated. To download older versions of SpamAssassin, check out here.

Note 2: Upgraders (even those from SpamAssassin 2.5, 2.6,3.0: Read and follow the docs carefully . SpamAssassin 3.10 is 98% the same installation procedure, but the differences could make your day long and tedious. I'll try to point out the differences. In particular:

  1. All upgraders: The DB_File module is now mandatory, at least if you want to use Bayesian analysis (which you do eventually).
  2. All upgraders: Old Bayes databases from 2.5x or from 2.6x must be upgraded or they will not work. Technically SpamAssassin should do this automatically, but it's best to make sure.
  3. All upgraders: Many command line switches and configuration file switches have changed. And auto-whitelisting is on by default.
  4. All upgraders: More modules are optional but recommended. More are required than before too. So you definitely want to read Part III, even if it seems old hat.
  5. All upgrders: More items have become plugins, which means they may or may not be on by default. We'll discuss this
  6. All upgraders: The rule files have moved. The core files are now stored in \perl\site\share\spamassassin. Most likely you don't care about those unless you edited them. The user configuration files such as LOCAL.CF have changed their location many times. Before copying any old CF, exercise caution, as many switches and rules have changed syntax. Special 3.10 note: Although they have moved, SpamAssassin 3.10 has a bug and looks for them in the wrong place. So read the docs!
  7. Upgrading from 2.5x: The SA-LEARN command syntax has changed (for the better).

Note 3: Perl 5.87 seems fine with SpamAssassin. This is the only build I've tested with version 3.10.

Note 4: If you are a neophyte (e.g. new to SpamAssassin and/or Perl, I'd advise reading Parts I-XI once through. Then glance at the rest. You do eventually want to understand Bayesian analysis - it is important. But it's not necessary to get scared the first time. Honest, I know people who have NEVER used Perl that have read through these notes and fairly easily gotten SpamAssassin up and running. Relax, take it slow, maybe play on a test machine and manually run some saves MIME spam and ham through.

What is SpamAssassin?

SpamAssassin is a wonderful open source product that performs heuristic spam analysis and RBL lookups, among other tests, to allow you to block most spam mail.

In its default form, it is designed and written for Unix platforms. This document provides information on how to get SpamAssassin working on Win32. This has steadily been getting easier and easier, thanks to the basic wonderfulness of the core SpamAssassin developers <g>.

Please note the following:

  1. SpamAssassin is open source software. SpamAssassin 3 is licensed under the Apache Software Foundation version 2 license. No guarantees or warranties apply to the software. It's used entirely at your own risk. No support, though if you have an intelligent question, you can ask it at the forums off the main SpamAssassin site. (Please don't abuse this! Platform-specific questions do not generally belong here!)
  2. Similarly no guarantees/warranties are made that the following steps will work. You are taking full responsibility for any damages, delays, negative consequences, etc. that may occur to you, your computer system, and your company, if you proceed.
  3. I provide only a little specific information on how to integrate SpamAssassin into any Win32 mail system. Mostly, that is ENTIRELY up to you and I will provide no assistance with this, although an experienced system administrator should be able to do so. However,
    • I do provide a separate document on how to integrate my own commercial app, Guinevere, which is a product that snaps into Novell's GroupWise messaging system. It's part of the User Manual. I also provide a few usage notes below.
    • Some free scripts are available for Outlook. See here for more information.
    • Several POP3 proxies are also available and referenced below.
    • Those whacky guys at IBM Development, have come up with an integration for Lotus Notes. This is referenced in a later section.
  4. If you come up with any corrections, fixes, modifications, additions, it would be GREATLY appreciated if you provide them free of charge for the whole community, preferably under the Apache Software Foundation version 2 license. That (at least in my personal opinion) is the spirit of open source software. Contributions to SPAMD, and DCC would especially be welcomed!
  5. If you distribute this document (which is fine), I require that you do NOT remove the above URL linking to the original HTML document, or my authoring credit -- and that you do NOT charge for providing this document (although you may feel free to charge for consulting time implementing SpamAssassin for your clients of course and you MAY include a copy of this document in a commercial distribution.)

The Builds That Were Used

These were tested on Windows XP SP2, and should function fine on NT, 2000, 2003 as well.

Multiple versions of 5.6.1 and 5.8x have been released previously and may not contain all the necessary modules required. Early 5.8x releases in particular are likely to have issues with SpamAssassin.

Use Windows NT/2K/XP/2003 ONLY. Preferably with NTFS volumes. It IS possible to get it running on Windows 95/98/ME (or at least I managed to way back in the SpamAssassin 2.2/2.3 days), but Perl acts unreliably on such platforms. During install you will find all kinda weirdisms. The primary one being every time PL2BAT calls, it will fail. You'd have to run these all manually (not a big deal) and hit CTRL-Z everytime the install "hangs" with NMAKE. Many specific Perl features or only partially supported on these platforms and can lead to failure or bizarre behavior. Trust me. It isn't worth it.

Specific Win32 Limitations

There are some problems with running SpamAssassin in a pure Win32 environment, largely related to its unix roots. Primarily, there are two Perl functions that don't function well on Win32 - fork and alarm - that are needed for the proper resolution of the problems listed below. While Perl 5.8.4 and above makes a noble stab at this, providing "sort of working" versions of fork and alarm, these are insufficient for production use.

Hence (links are provided below for more technical info if desired):

  1. SpamD doesn't run on Win32, so Spam Assassin must be run in "serial mode" mode. See this section for more.
  2. Pyzor is unreliable.
  3. Razor 2 currently doesn't work.
  4. DCC currently doesn't work
  5. DomainKeys, an optional plugin, won't work without lots and lots of effort.
  6. The Report Spam feature (-r switch) doesn't work because of ALARM. As this is only relevant to users of DCC, Pyzor, or Razor, it has zero priority right now :)

But it does work very well within these parameters!

Part Zero: Upgrading from an older version

Here's what I recommend if you are upgrading from an older version of SpamAssassin, assuming all default installation paths. If you've never installed SpamAssassin before, you can just skip this section for now, and read it later.

Ok, this is an important upgrade note for all versions. SpamAssassin 3.10 now installs the core rules to \perl\site\share\spamassassin and the user rules to \perl\site\etc\mail\spamassassin. However 3.10 has a bug. It ACTUALLY looks for these rule directories in \usr\share\spamassassin and \etc\mail\spamassassin respectively. So at the of the install, we'll copy things there, and that location is where things should be altered, until this bug is fixed. The CF files are not very compatible with older versions anyway.

  1. Backup your customized local.cf and other templates They will be lost shortly. In fact, wise paranoids back up everything!
  2. Delete these directories, if they exist: c:\etc\mail\spamassassin directory, c:\perl\etc\mail\spamassassin.
  3. Delete these directories if they exist: c:\perl\share\spamassassin directory,c:\perl\site\share\spamassassin,C:\perl\site\etc\mail\spamassassin
  4. Delete the c:\perl\site\lib\mail\spamassassin directory and the c:\perl\site\lib\mail\spamassassin.pm file.
  5. Delete the SpamAssassin related files (spamassassin, sa-learn, sa-update, spamc, spamd) from c:\perl\bin
  6. The Bayes and AutoWhiteList databases are in %USERPROFILE%\.spamassassin. Normally, you want to preserve them and thus would leave these alone. But if you're doing a clean install from scratch, whack this directory too.
  7. Do the install, remembering to alter the batch files to include the SET RES_NAMESERVERS bit. Net::DNS is still not 100% reliable in auto-identifying your DNS servers.
  8. Copy back templates. I recommend that you only copy back the local.cf, etc, not the rulesets, which will conflict with later versions of SpamAssassin. Copy them to \perl\site\etc\mail\spamassassin
  9. If you were a SpamAssassin 2.5x /2.6x user running Bayes, see these notes. You must upgrade your old Bayes database (or throw it away) or it may not work.
  10. Run spamassassin --lint to check your local.cf and third party cf files for unrecognized configuration options. Don't be surprised if a few errors show up.
  11. Test!

Major compatibility issues upgrading from older versions:

The latest version of SAConf has been updated to deal with the CF switches changes and convert these correctly. See the SAConf section

PART I: Installing Perl

Note: If upgrading Perl, uninstall old version - conflicts and issues have been reported when installing Perl over an existing Perl install.

  1. This is very easy. Go to http://www.activestate.com , and select the ActivePerl download. Choose the MSI installer version. Grab Perl 5.87. If you use versions or builds other than the ones described you may be sorry. Or you may not be.

    NOTE: If you have to use the NON-MSI version, that's fine. Download it, unzip it and run INSTALLER.BAT. It mostly works the same. You'll have to manually add C:\PERL\BIN to your PATH though.

  2. Double click the MSI file and run it. All of the default options are fine (though you may NOT want to enable the ISAPI integration into IIS, if you aren't going to use Perl on your web server). Don't worry if the Perl installer seems to take a really long time. Wait at least 10 minutes before worrying. (The ActivePerl installer takes a long time to generate the HTML docs.)
  3. Open a DOS box and type PERL -V to verify all is well.
  4. In subsequent sections, it will be assumed that Perl was installed in C:\PERL. Make appropriate changes if necessary.

You shouldn't have to upgrade Perl too often - maybe once every year or two. Always check Part III to see if new module requirements might make this necessary.

PART II: Installing NMAKE

  1. Why? Because we need NMAKE to build Perl modules.
  2. You can obtain NMAKE from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe
  3. Extract the files, and place them in C:\PERL\BIN . Both NMAKE.EXE and NMAKE.ERR are needed.

Note: An alternative link to NMAKE is Q132084 on MS's knowledgebase. I'm mentioning this because MS's ftp site seems to go down about once a month for a day or two.

PART III: Installing the Necessary Modules

Perl uses modules to extend the language's capabilities. Many of them come with the core distribution, and many more are available.

SpamAssassin requires several modules not in the core distribution of ActivePerl. Other modules that are included in the distribution still need to be updated.

There are 3 basic ways to install new modules in the Perl world: manual, PPM, CPAN. The latter two are generally easier than the manual method. In general, I prefer installing them using PPM (Perl Package Manager), rather than CPAN (which experienced PERL users are likely familiar with) when possible, as it downloads the precompiled versions for Win32. It's also a lot less confusing for the novice. However, the one disadvantage of PPM is not everything is as up to date as manual or CPAN downloads.

Let's go through the PPM and manual methods, after which I'll present a table of all the core modules and my recommendations.

Method I: Using PPM

The basic method of operation is pretty straightforward. Note PPM is sometimes case-sensitive.

  1. To run PPM, you open a DOS box and type PPM3 (upgraders, note this has changed! We used to type just PPM, which runs the older PPM. I prefer to use PPM3 now and some newer modules may require it.)
  2. To find out what modules PPM thinks are installed type query *. This will list the packages and the versions. PPM only lists modules that were installed when Perl was first installed and modules then installed by PPM itself. So if you install a module using the manual or CPAN methods, PPM does not know about it, does not display it, and may well do something annoying like overwrite it.
  3. To install a module, normally you just type install <modulename>. PPM can search more than "repository" (web or hard drive location) for matching modules. If more than one "repository" contains conflicting versions, no install will occur. Instead, they'll be listed with numbers, and you'll need to type for example install <modulename> 3 to install the third choice.
  4. To uninstall a module, you type uninstall <modulename>.
  5. To search the repositories for a module , type search <pattern> (such as Net*). This is case-sensitive.
  6. When done, type quit.
Geek Note: PPM connects to the repository via TCP Port 80. Firewall people take note! If you have difficulties installing from PPM, [you] need to authenticate to
your proxy server. In this case, you need to add the following environment variables.

HTTP_proxy=http://your.proxy.com
HTTP_proxy_user=yourid
HTTP_proxy_pass=yourpasswd

Method II: The manual way

For most modules you can avoid this. Which is good. It's a bit intimidating at times. It's worth understanding though the basic procedure because the main SpamAssassin (which itself is a module) install will go through the same process.

Open a DOS box.

  1. Download the module and unzip it,
  2. Open a DOS box and go to the subdirectory containing the downloaded, unzipped module.
  3. Type in order (with a carriage return after each command)
PERL MAKEFILE.PL
NMAKE
NMAKE INSTALL

You may be prompted for various bits of information during the makefile.pl step.

This looks more impressive than it is. The first line runs a Perl program that creates the final "makefile". It also checks for dependencies (other modules that this module requires) and warns you about them. NMAKE compiles the makefile, and NMAKE INSTALL follows up by copying the modules to the PERL subdirectories. Now all Perl programs can use these modules.

More experienced Perl users run another step - NMAKE TEST between NMAKE and NMAKE INSTALL. This is not required - and can be unnecessarily discouraging in its results since failures are not uncommon.

Now to install the modules!

Please read this entire section before proceeding. Install each module in the order shown.

When PPM is referenced below, this means use PPM3 and utilitizes the above PPM method. When manual is referenced below, a hyperlink is provided to a zip file of the module. Unzip it to a temporary directory and follow the manual method.

The table below should be your guide.

Module Name 5.8.7 includes Upgrade using Comments
HTML-Parser 3.45 not needed older builds of both Perl 5.6 and 5.8 will probably not have a new enough version (Check with PPM in QUERY). At least 3.24 is required.
Digest-SHA1 2.10 not needed older builds of both Perl 5.6 and 5.8 may need to upgraded via PPM. Check using QUERY in PPM. This module is also required
DB_File - PPM On Perl 5.8.7, PPM will install version 1.812. This is technically optional, but since it's required for Bayesian functions, and that is quite important for full functioning of SpamAssassin 3, it isn't really.
Net-DNS - PPM or manual (see comments) On Perl 5.8.7 PPM will install version .53, which appears to work correctly on Win32 systems. At the same time, Net-IP 1.24 is installed. On Perl 5.6.1, however, version .42 will be installed by PPM - which DEFINITELY DOESN'T always work correctly on Win32. Many versions do not (most .2x and .3x versions, .49, .50). I recommend that if you are running Perl 5.6.1 or for some reason are having trouble getting version .48 to install with PPM, you manually download and install the Net-DNS module at this point, which will overwrite any previously installed Net-DNS. For later versions, you might also have to install Net-IP.

The moral of all this is check your version of Net-DNS (type 'query Net-DNS') before leaving PPM

Time-HiRes - PPM PPM will install v1.49 for Perl 5.87. Optional, and not terribly important
IP-Country - PPM PPM will install v2.20 for 5.84.. Optional, used for the Country Relay plugin
Mail-SPF-Query - PPM This module is optional and used only for SPF (which is only of variable use currently IMO, but many may disagree) queries. PPM will install 1.997 for 5.8.7, and also install Net-CIDR-Lite .18 and Sys-HostName-Long 1.4 at the same time.

Note that this module requires Net-DNS .34 and above and will automatically install Net-DNS via PPM. This would be fine and IS fine on Perl 5.8.7 where Net-DNS .53 is installed by PPM. But see below if you are running Perl 5.6.1. You probably just installed Net-DNS .42 by accident. Which would be bad - this version doesn't work properly. It is probably better to get Net-DNS up and running first

DBI - PPM On Perl 5.8.7, installs DBI 1.48. You do not need this module unless you are also installing the appropriate DBD for your database system. Usage of which is beyond the scope of this manual.
IO-Zlib 1.4 not needed used if you use sa-update.
Archive-Tar 1.23 not needed used if you use sa-update
Pyzor, Razor, DCC - - Don't bother. You probably can't get them running on Win32, and they are either overrated or non open source. Or both.
Net-SMTP Net-Ident, IO-Socket-INET6, IO-Socket-SSL - - Don't bother. These are used with spamd, which doesn't run on Win32.

Part IV: Obtaining SpamAssassin

Go to http://spamassassin.apache.org, choose Download, and get the ZIP file distribution.

Use WinZIP or other ZIP extractor, and extract the Zip file off the root. For SpamAssassin 2.41, for example, this will create Mail-SpamAssassin-2.41 off C:\

Note: I'll refer to this directory as the SPAMSOURCE directory in subsequent sections. After you finish Part VIII, you may delete SPAMSOURCE if desired - it is only needed to reinstall SpamAssassin.

PART V: Pre-Install

Open a DOS box, go to the SPAMSOURCE directory and type

PERL makefile.pl

You will be asked several questions. The first one is the crucial one. You should answer NO unless you have a C compiler installed and want to build spamC (you probably don't). Note it defaults to yes, which will likely give you a frightening and fatal error during NMAKE.:

Build spamc.exe (environment must be set up for C compiler)? (y/n) [y] n

Then next question is the e-mail contact info for the administrator. You should answer truthfully. The last few ask about types of tests to be run if you run NMAKE TEST (which we're not going to do). You probably want to answer No to these.

After this, SpamAssassin will kindly tell you about many recommended and required modules you may have missed. Use Part III to judge if this really matters.

If you make a mistake, just PERL makefile.pl again.

PART VI: Getting NMAKE to Work, Part I

This may come as a shock to pre-2.50 SpamAssassin users on Win32, but this is now effortless. From a DOS box in the SPAMSOURCE directory, just type:

NMAKE

at the command prompt.

PART VII: Getting NMAKE to Work, Part II

With PERL modules, you go through through 3 separate runs of NMAKE: NMAKE alone (which we discussed), NMAKE TEST (which is an optional exercise and will often fail for reasons too tedious to go into here), and NMAKE INSTALL (which actually installs everything).

Here we get NMAKE INSTALL to work properly. Note this section is much much shorter than previous HOWTOs - it works better!

From a DOS box in the SPAMSOURCE directory, type:

NMAKE INSTALL

Perl afficionados will note that we've skipped the NMAKE TEST phase. This can be done if you want, but it isn't 100% necessary, and takes quite a long time. Also, some of the errors are confusing or misleading. Feel free to run it once though.

PART VIII: A Few Tasks Before We're Done

  1. Critical: SpamAssassin 3.10 has introduced another bug. It installs the core rules to c:\perl\site\share\spamassassin and the user files to c:\perl\site\etc\mail\spamassassin,but when it runs, it actually looks for these in C:\usr\share\spamassassin and c:\etc\mail\spamassassin. So copy them there. And modify rules in THAT location for now, or double-maintain until this is fixed.
  2. Critical: Next, find \perl\bin\spamasasssin.bat (it is probably read-only, which will cause you grief in a second), and add at the beginning (well, nearly: right after the @ECHO OFF line.)
    SET RES_NAMESERVERS=ipaddress
    SET LANG=en_US

where ipaddress is the ipaddress of your DNS server. If you have more than one, add additional ones, separating with a space character. This is needed for all RBL lookups to function properly. (Net-DNS in theory can query Windows for the right nameserver, but has not demonstrated reliability in this matter.)

You should make similar changes to \perl\bin\sa-learn.bat if you plan on using the Bayesian spam functionality.

  1. Critical: Users have reported problems when Razor, Pyzor, DCC are enabled. By default if you open c:\etc\mail\spamassassin\init.pre, you'll see they are all commented out with # marks, except for pyzor. Stick a # in front of the loadplugin pyzor line.
  2. (I used to have a batch file for generating HTML docs, but this becomes tiresome. They are all there for you at http://spamassassin.apache.org/full/3.1.x/dist/doc/ )
  3. Critical: Ok, we will try a test. From the SPAMSOURCE directory, type
    spamassassin -D < sample-spam.txt
  4. We'll do more exhaustive testing in a second, but you should have NO errors.

PART IX: Testing SpamAssassin

  1. Create a temporary directory called TEST off the root. (C:\TEST)
  2. Copy the sample-spam.txt and sample-nonspam.txt files from the SPAMSOURCE directory.
  3. Open a DOS box and go to C:\TEST
  4. Type
    spamassassin -D -t < sample-nonspam.txt > nospam.txt

This should run fine, and when you look at nospam.txt the report should be hunky dory. See the next section for comments about the -t switch before you e-mail me!

  1. Type
    spamassassin -D -t < sample-spam.txt > spam.txt 

This should run fine and the report should label this as spam. Note the RBL check should register as spam as well. If no RBL check catches this spam, verify you set the DNS servers as in Part VIII. There are also some test scripts supplied in Part X below.

PART X: Using and Configuring SpamAssassin

Here are some examples (none of them recommended by default - this is for purposes of illustration)

# Sample LOCAL.CF # 
#
# This one says if it's FROM our friend Bill, it CAN'T be spam! 
whitelist_from billg@microsoft.com 
# This one changes the default of MAPS (off, because you can't use it for free, to on) 
score RCVD_IN_RBL 1.0 
# Mail that is pure HTML is normally given a score of 4.5 (and since the default 
# threshold to mark as spam as 5 this could be a problem. We change that below
score CTYPE_JUST_HTML 2.0
# And that spam threshold of 5 is too harsh and flags stuff as spam incorrectly 
# too often! Let's make it 7.5! 
required_hits 7.5 
# We like Russians. They use BASE64 text all the time, because of their 
# whacky character set, and SpamAssasin marks this as likely spam (score=3.2). 
# Let's change that score 
BASE64_ENC_TEXT 0.00 
#Razor, DCC, Pyzor don't work well on Win32. Let's turn them off
use_razor2 0
use_pyzor 0
use_dcc 0
# Bayesian analysis is great, but the auto-learner slows down things a bit. I'd rather
# manually train the Bayes analyzer.
auto_learn 0
# SpamAssassin always wastes some time testing whether DNS is available. This means 3 wasteful
# DNS lookups per message. I maintain my own DNS server, and it's very reliable, so let's disable
# this test
dns_available yes

You might find this program helpful.

PART XI: Windows GUI for Configuring SpamAssassin

For us Windows GUI slaves, the SpamAssassin config files can be a little confusing....so...Take a look at here, where I've provided just such a GUI.

PART XII: More Rule Sets

If you want to get more rules than those included in the SpamAssassin distribution, look at http:///www.rulesemporium.com

To install, copy these files into the same location as your local.cf file, usually \perl\site\etc\mail\spamassassin

These rules may prove to be slow, unreliable, memory hogging, and full of false positives. They are not necessarily tested nearly as carefully as the core rules Most of them are great - I'm not trying to denigrate their wonderful efforts, just point out that blindly applying these would be foolish. Many may also be optimized or run only with an older version of SpamAssassin! Use at your own risk and be prepared to remove them!

SpamAssassin as of version 3.10 now ships with it's own updater module, sa-update. Since nothing has been published as an update yet, it's hard to comment on the quality thereof.

Part XIII: Bayesian Spam Analysis

New users of SpamAssassin are advised to skip this section the first time or two they read this HOWTO and return to it a little later when they feel more comfortable. This is important stuff to learn eventually, though.

For SpamAssassin 2.5 or 2.6 users that built a Bayes database, this next bit is critical. For anyone else, this sidebar has no interest whatsoever. I've plagiarized and rewritten pieces from the mailing list and the UPGRADE file :)

SpamAssassin 3 has a new Bayes backend and database format. Your old database(s) will automatically be upgraded the first time SpamAssassin 3 tries to write to the DB, and any journal, if it exists, will be wiped out without being synched In addition,support for Bayes databases in formats other than DB_File, due to a large number of serious issues (including crash and concurrency bugs) has been dropped. So, what you want to do is something like this

  1. Stop running spamassassin/spamd (ie: you don't want it to be running during the upgrade) and backup your bayesian database.
  2. run sa-learn --rebuild (which syncs your journal)
  3. Upgrade SpamAssassin, using the instructions below.
  4. Install DB_File module, if not installed already.
  5. Run sa-learn --sync followed by sa-learn -D --import to migrate the data into new DB_File format.
  6. Start running spamassassin/spamd again

For a lengthy description of how Bayesian Spam Analysis works, go to http:///www.paulgraham.com and see "A Plan for Spam". It's reasonably readable, even if statistics make me break out in hives.

The short semi-inaccurate version: Given training, a spam heuristics engine can take the most "spammy" and "nonspammy" words and apply probablistic analysis. Furthermore, once given a basis for the analysis, the engine can continue to learn iteratively by applying both it's non-Bayesian and Bayesian ruleset together to create evolving "intelligence".

SpamAssassin 2.5 and above support Bayesian Spam Analysis. This is a new feature, quite powerful -- but doesn't come into its own until you build a sufficient database. In fact it will not function at all until you have run through at least one set of data using SA-LEARN (see below).

The pros and cons of Bayesian spam analysis

but ...

Special Configuration Switches

There are several switches in SpamAssassin that you may modify in your local configuration files. These are important to understand. More detailed information is in the SA-LEARN.HTML and CONF.HTML files you generated in SPAMDOCS.

use_bayes 0 | 1 (default: use_bayes 1) - When this is on, SpamAssassin will use the database of Bayes statistical tokens as additional tests to decide if the message is spam or not. If this is off, no bayes operations will take place whatsoever - auto-learning will be off, and all your Bayes tokens will be for naught. It may be a bit faster though, as noted above. If you have problems with Bayes, and decide you hate it, this is the switch to turn everything off with. Note that the Bayesian db will remain unused with regard to SpamAssassin scores until at least 200 spam and 200 nonspam have been auto-learned or manually SA-LEARNed.

bayes_auto_learn 0 | 1 (default: auto_learn 1) - When this is on, IF a message is marked as spam or non-spam (HAM is the nickname given to nonspam) AND certain thresholds are exceeded, then and only then the message will be dynamically tokenized and added to the database. This allows SpamAssassin to "learn" and reduces the need for constantly having to refresh the corpus every few months.

bayes_auto_learn_threshold_nonspam n (default -2.0) - If a message has a score beneath this number AND auto_learn is enabled AND use_bayes is enabled, the message will be automatically tokenized and learnt as nonspam.

bayes_auto_learn_threshold_spam n (default 15.0) - If a message has a score above this number AND auto_learn is enabled AND use_bayes is enabled, the message will be automatically tokenized and learnt as spam.

Many configuration switches exist for controlling the maximum and minimum size of the Bayes database, journal synching, etc. In the interest of clarity, I'm punting and telling you to read the official documentation. Search for "bayes_expiry", and you should find all the interesting switches thereof clustered in that section.

The SA-LEARN Program

Purpose

The SA-LEARN tool, run from the command-line is the biggie. You use this tool for the following purposes:

Important Note: If you do not run SA-LEARN at least once on a set of spam/nonspam, the bayes database will NEVER be created, and the Bayesian algorithms will never kick in.

SA-LEARN Syntax

I've shamelessly stolen details from the SA-LEARN.HTML documentation and placed it below. I've removed quite a few options in the interest of clarity and abbreviated the descriptions - do read the documentation when you have a chance!.

For SA 2.5 upgraders: The syntax has changed a bit. It's easier actually - it auto detects if you are feeding a file or folder, so

sa-learn --ham c:\test\ham

works beautifully on a folder called ham just as

sa-learn --ham c:\test\sample-nonspam.txt

works beautifully on one file. There is no longer a need to use the --file and --dir options.

sa-learn [options] <directory or filename>

Options:

--ham
Learn the input message(s) as ham.
--spam
Learn the input message(s) as spam.
--sync
Sync the database with the journal, typically done after learning with --no-sync.
--dump
Dumps the contents of the Bayes databases to the screen, good for quick sanity checks.
--forget
Forget a given message previously learned.
-a, --auto-whitelist
Use auto-whitelists. While learning, add addresses to the auto-whitelist as appropriate.
-D, --debug-level
Produce diagnostic output. Strongly recommended if you have any Bayes issues.
--no-sync
Skip the slow rebuilding step which normally takes place after changing database entries. If you plan to import many files of spam and ham in a batch, it is faster to use this switch and run sa-learn --sync once all the folders have been scanned.
--import
If you previously used SpamAssassin's Bayesian learner without the DB_File module installed, this switch migrates that old data into the DB_File format. Also converts DB versions in general.

Example

sa-learn --spam c:\mysamples\spam

will tell SpamAssassin to go through the contents of the c:\mysamples\spam directory, and add all of the messages as SPAM.

Basic Guidelines and Caveats

  1. Start by building a significant sample of both spam and non-spam (The SpamAssassin program calls non-spam "HAM"). I suggest several thousand of each, placed in SPAM and HAM directories. Yes, you MUST hand-sort this - otherwise the results won't be much better than SpamAssassin on its own. Verify the spamminess/hamminess of EVERY message. I urge you to avoid using a publicly available corpus (sample) - this must be taken from YOUR mail server, if it's to be statistically useful. Otherwise, the results may be pretty skewed. Yes, they should sort themselves out after a while, but....

  2. Use the SA-LEARN with the appropriate --spam and --ham switches to teach SpamAssassin about these samples.

  3. Let SpamAssassin proceed, learning stuff. When it find spam and non-spam it will add the "interesting tokens" to the database, assuming auto-learning is enabled.

  4. If you need SpamAssassin to forget about specific messages, use SA-LEARN --forget. This can be applied to either spam or non-spam that has run through SpamAssassin or through the SA-LEARN processes. It's a bit of a hammer, really, lowering the weighting of the specific tokens in that message (only if that message has been processed before).

  5. You may need to edit the batch files (SA-LEARN.BAT) in C:\PERL\BIN to add the SET RES_NAMESERVERS statements referenced previously. Actually, in SpamAssassin 2.50 this won't make a difference - the DNS based scores do not affect the Bayesian learning process. In the future it may. Be forewarned.

  6. After you've built a sufficient corpus, SpamAssassin will default to "auto-learning" and adding stuff to the Bayesian dataabase from now on. Run another sample message or two that you didn't put through the initial learning process, using the -D debug switch. See?

  7. Be aware the Bayesian databases will be stored and accessed from %USERPROFILE%\.spamassassin (eg C:\Documents and Settings\username\.spamassassin in WinXP and Win2K). Make sure you're logged on as the user that you'll be running SpamAssassin as....or it won't find them! (Just FYI, there are configuration switches to override this)

Advanced Features & lingering issues with Win32


Proceed with caution! Geeky discussion follows!

PART XIV: AutoWhiteList

The AutoWhiteList is a neat SpamAssassin function. With the feature enabled, people who send non-spam are added to the "whitelist" - meaning they become increasingly less likely to have their mail falsely flagged as spam. Each time the address is referenced, it gets weighted a bit more in one direction or the other. It's pretty cool that SpamAssassin can "learn" as it goes along. It's been pointed out the name is a bit misleading because it can auto-blacklist just as easily.

SpamAssassin 3 features 3 changes to the AutoWhiteList

  1. You turn it on/off with the use_auto_whitelist 1 (or 0 to turn off) configuration file switch. The -a command line switch no longer works.
  2. It defaults to being on, instead of off, which is a pretty big change indeed.
  3. It uses the bayesian database format.

If you don't use autowhitelisting, you'll have to manually add whitelisted addresses to the configuration files. This is a little more tedious unfortunately, and there's no weighted learning. See the usage notes in Part X for hints on how to do this. On the other hand, being a conservative fellow, I'd advise you first get SpamAssassin up and running before adding this or any of the following features.

PART XV: SpamC/SpamD

What are SpamC and SpamD? They are a pair of programs that work together to provide SpamAssassin functionality in a distributed, cross-platform, and ultraefficient way. They support the loading of user scores via LDAP, SQL, and other mechanisms.

How does this work? SpamD ("SpamAssassin Daemon") is loaded once and only once. At that time it loads the SpamAssassin engine and ruleset into memory. It sits running in the background of a machine, listening on a specific TCP port.

SpamC ("SpamAssassin Client") - which can be on that machine or any other - contacts SpamD via TCP/IP. After some initial handshaking, it sends the message to SpamD. SpamD runs the SpamAssassin engine, and reports the results. SpamC then disconnects from SpamD and generates a local result file or score. You can think of SpamC as a lightweight version of SpamAssassin, needing to run wherever the mail server is. The memory overhead and CPU usage of SpamC itself is practically negligible.

The advantages of such an approach versus running SpamAssassin itself serially are myriad:

Here's the big problem though: Currently SpamD doesn't run on a pure Win32 system, mostly because the fork command doesn't work well on Win32 (although there are additional minor issues with SysLog and other modules), so Spam Assassin must be run in "serial mode". in a pure Win32 environment

Your SpamD alternatives are

SpamC used to be just as grim a scenario for Win32 administrators, but that issue is now more or less solved, and in fact an embarassment of riches is now available. You have several choices:

PART XVI: Razor, Pyzor and DCC

Razor is kind of a neat concept - it checks your e-mail against a database of KNOWN spam mail. It computes "fuzzy hashes" (they can't be exact hashes, because spammers often put unique tracking info, etc. into the mail), to make unique fingerprints of these mail, and you share the results with other Razor users. The commercial branch of this product is CloudMark, which provides plugins for Outlook and Exchange.

DCC (Distributed Checksum Clearinghouse) and Pyzor follow a similar concept in their algorithms.

SpamAssassin can use these modules, and add their results to all the tests it performs - at least on non-Win32 platforms.

Note: All of these build perfectly without effort on Cygwin, should you choose to use that.

I've stopped efforts to get these working. Razor and DCC are no longer open source, and are a pain to get working and I think they are overrated. Pyzor is ok, but requires Python installed and a fair amount of effort.

.

PART XVII: Plugins

SpamAssassin 3 introduced the concept of plugins. This means the developers can concentrate on core code, and many other pieces of functionality can be spun off as independent code plugins.

No plugins are loaded by "default", per se. However SpamAssassin ships with two files in the User Rules Directory, INIT.PRE and V310.PRE that enable several. V310.PRE as the name implies lists plugins added in version 3.10.

Open one of these in a text editor. In addition to comments, you'll see some lines that begin with loadplugin (pluginname).. Some are commented out, in which case they are inactive, some not. You enable/disable them by removing/adding the # comment in front of the loadplugin statement.

Here's a commentary on them. This is my opinion, YMMV.

Plugin On by Default?/Recommended? Comments
RelayCountry No/No A couple of rules use this to determine if it's been relayed through multiple IP addresses that belong to different country blocks, a possible spam sign. It might be worth enabling this. If you do, you need the IP-Country Perl module installed.
HashCash Yes/No An interesting if more or less worthless idea, HashCash says "why not charge for e-mail, so that bulk e-mail gets too expensive". This calculates a special hash for these e-mails. No one uses it, so unless you fnd it useful for internal purposes, forget about it.
URIDNSBL Yes/Yes Also known as SURBL, this is the one must-have plugin listed in INIT.PRE. You should have this on, it makes a big difference. However it does generate MANY DNS queries, which is potentially an issue in some cases.
SPF Yes/No SPF is more or less worthless but people keep believing it is useful. It is worthless for antispam, but good for anti-phishing potentially. If you leave this on, you need the Mail-SPF-Query Perl module installed.
DCC No/No DCC was once a great service, albeit difficult to install on Win32. It used to be a core SpamAsssassin function; it is now a plugin and disabled by default since it is no longer open source. Leave it off.
RAZOR2 No/No Razor was once an OK service, albeit nearly impossible to install on Win32. It used to be a core SpamAssassin function; it is now a plugin and disabled by default since it is no longer open source. Leave it off.
PYZOR Yes/No Pyzor is open source, and works ok, but requires Python, is difficult to install on Win32 and works poorly. Turn it off.
AntiVirus No/Maybe A very very basic AntiVirus - this simply blocks any message containing executable content, regardless of extensions. If you really have no AV product (what?!?!?) and you want to block the spam generated, and don't care much about this, sure turn it on.
AWL Yes/Yes The AutoWhiteList functionality is here. See Part XIV for more, but generally you want this on.
TextCat No/Maybe The Language detection (ok_languages) is now here instead of in the core. This CAN be useful for blocking mail in languages you don't use, but is not 100% accurate, and is very very slow. Leave off unless you want to try this!
AccessDB No/No This is very unix oriented. It checks aganst special access databases used by products such as PostFix and SendMail. It could be useful there for whitelisting/blacklisting. Probably not on Win32 though.
WhiteListSubject Yes/Yes Let's you write whitelist_subject (pattern) rules against the subject. Of course since this is easily forged, it's questionable how safe it is. But you could have a gentleman's agreement with your clients that if they use say PASSMESAPLEASE in the subject, that would subtract 10 points.
DomainKeys No/No DomainKeys, or DKIM, is a new standard proposed by Cisco/Yahoo. It fixes many of SPF's faults, but the fact is it is unused and a better anti-phish than anti-spam tool.
MimeHeader Yes/Yes Let's you create rules against arbitrary MIME headers (instead a limited list like the regular header rules). May be increasingly useful in future.
ReplaceTags Yes/Yes This is an optimizer for RegEx, already heavily used.
SpamCop Yes/No This is the automatic reporting service for SpamCop. Needs Net-SMTP to work, and I think it's stupid.


Note that you use switches like use_razor2 in your local.cf, they will FAIL and generate error messages if the corresponding plugin isn't loaded. Hence use your pre 3.10 rules with care!

Integrating SpamAssassin with various Windows Apps


A short discussion of ways to integrate SpamAssassin with various e-mail clients.

PART XVIII: A POP3 Proxy for SpamAssassin

These are cool because they can support things like Eudora, Outlook Express, Netscape, etc.

Probably the one with the most promise was SAProxy It was an updated version of POP3Proxy (see below), with a GUI and goodies. It has now been sold to Yahoo and is no longer available.

Another one is JSpamAssassin, a Java based POP3 proxy server. It's at http://sourceforge.net/projects/jspamassassin/

A POP3 proxy for SpamAssassin is at Nick Fisher's site. It's in sort of unfinished format, but could serve as the basis for a more polished script: http://www.nickdafish.com/SAPP.htm.

An older version of SAProxy, known as POP3Proxy can be found here: http://mcd.perlmonk.org/pop3proxy/.

Finally, a commercial POP3 proxy using SpamAssassin which integrates with Eudora is here: http://www.spamnix.com.

There's actually been a spate of these POP3 proxies recently, so unless your product is particularly notable, I probably won't list it. I suggest using Google or other such search engines for more.

PART XIX: Integrating SpamAssassin for Win32 With Outlook

A commercial version of SpamAssassin written for Outlook was available from Deersoft. However, Deersoft was purchased by NAI, which has marketed its own variants of SpamAssassin to the marketplace.

A free, add-in for Outlook XP is at: http://sourceforge.net/projects/saoutlook/. However, development appears to have stalled.

By permission of the author of the plugin, Jason Reusch, I've reproduced a letter from the SA mailing list describing some of the theory behind this (note that he posted this very early on in his development, and the specific code has no doubt changed a goodly amount):

PART XX: Integrating with Novell GroupWise

Please have a look at my program, which provides full SA integration with the build you've just created.

Some specific notes for Guinevere users that are upgrading to 3.1:

First of all, one time saver: you no longer need (or should) apply the Perl script modifications referenced in HandJ.Doc

However SpamAssassin 3.1 has significantly broken compatibility with GroupWise. You will need to make many changes to the SA code to fix this.

If you don't, your messages will get munged and GWIA will be unhappy. The problem is GWIA doesn't generate/use pure MIME. It puts an SMTP router envelope in front.

See this document for more information BEFORE PROCEEDING!

Local.CF Recommendations

There are other changes and settings. Let's start by suggesting that your local.cf mightcontain

report_safe 0
use_auto_whitelist 0
rewrite_header Subject ***Spam***

.... as well as any personal settings...

  1. AutoWhiteListing is now on by default and must be turned off with the above switch. If you want it on, remove that switch or change the value to 1. In the Guinevere 2 program there's an option to "Enable AutoWhitelisting". With SpamAssassin 3, this checkbox must always be OFF, as it corresponds to running SpamAssassin with the -a command line switch - which as of SpamAssassin 3, no longer exists. Hence this checkbox was removed with Guinevere 3, which shipped with SpamAssasin 3.
  2. If you have report_safe 1 in local.cf (and this is the default if report_safe is not in this file), the main text becomes the mark up report and the original message is embedded as a forwarded message. This will mess up resubmits and pass throughs a la the Heath And Jame mod.
  3. rewrite_subject was on by default (this inserted ***SPAM*** into subject). It's off now by default and the switch name and usage has been changed, so we need to add this appropriately to local.cf. Note SACONF does this for you in a friendly GUI way.
  4. DCC, Pyzor, Razor as before remain unsupported and best forceably switched off. See the PLUGINS discussion. If you have stuff like use_razor2 0, use_dcc 0, use_pyzor, REMOVE it, and instead edit the INIT.PRE and V310.PRE files.
  5. The latest Winspamc, with guinevere-specific instructions not available elsewhere is in the TOOLS subdirectory as of Guinevere 2.0.14. You may use this spamc or the SA SpamC (requires Guinevere 2.0.17 or Guinevere 3).

PART XXI: Microsoft Exchange

PART XXII: Integrating with Lotus Notes

Kim Dowds has provided this neato option, though the project seems dead.

Another user commented that

"I also came up with an Agent that will then run the email. Some of the code was from the posting from the projectLounge site, and some was from my own. I hope this information will be useful to other Lotus Notes users as well."