magazine resources subscribe about advertising

New Architect Daily
Commentary and updates on current events and technologies

CMP Media E-Book

Download your copy today.

Research
Search for reports and white papers from industry vendors and analysts.

This Week at NewArchitect.com Subscribe now to our free email newsletter and get notified when the site is updated with new articles







Day of Defeat Online Gaming

 New Architect > Archives > 2001 > 08 > Features  

Save Your Site from Spambots

Techniques to Prevent Address Scraping

By Steven Champeon

The problem: too much spam. Unsolicited advertising email continues to account for untold business losses each year. To give you an idea of the scope of the problem, in 1998 AOL reported that of the approximately 30 million email messages its servers handled each day, between 5 and 30 percent were spam. Assuming that this rate is true for other email providers as well, spam takes a significant economic toll on business, not merely in terms of Internet resources, but in lost employee productivity as well.

Sometimes, whether you receive bulk email is just the luck of the draw. Target addresses are often generated at random, or constructed from common usernames and domains. My own mail server is configured to forward any mail sent to my domain, regardless of address, straight to my account. Among the legitimate mail, I notice lots of spam for variations on hesketh.net (for example, ed@hesketh.net), even though there are very few real email addresses in that domain (which is just the Web hosting arm of my business).

There are many other ways in which real email addresses commonly fall into the hands of spammers. Any publicly available source of email addresses can be considered fuel for their activities. Usenet newsgroups and mailing lists have long been gold mines for spammers, who happily steal return addresses from posts.

One of the most popular sources of addresses for bulk mailings, however, is the Web. Software packages, known informally as "spambots," spider the Web collecting information in much the same way that search engines do. The difference is that spambots have but one purpose: to "scrape," or harvest, every email address they find on the pages they analyze, and add them to bulk email lists.

Email addresses might be harvested from posts on public Web forums or message boards. Or, worse—they could be gathered from your own corporate Web site. Fortunately, if you're in charge of maintaining your company's Web servers, there are steps you can take to prevent this from happening.

Apache to the Rescue

Apache—based on the old NCSA httpd—is the world's most popular Web server. According to the current Netcraft Survey, Apache runs on more than 62 percent of the world's Web servers. With its mod_rewrite module, Apache presents an effective means of blocking spambots from harvesting your site's addresses.

To build Apache with support for mod_rewrite from scratch, download the latest source distribution for your system from an appropriate mirror of apache.org. The file install.sh, available online, includes all of the command line options you'll need for most Unix systems. For other operating systems, see the relevant documentation on the Apache site, or read the INSTALL documentation that comes with Apache.

If you're already running Apache, simply key in the following command (substituting the appropriate path to your existing Apache binary) to check whether your server installation already supports mod_rewrite:

/usr/local/apache/bin/httpd -l

It will either show you that you have support for Apache's runtime shared objects, where modules are compiled and then loaded as needed, or else list the modules that were linked during a static build. Examples of the different types of output you can expect are shown in modules.txt, online. If the output of this command includes mod_rewrite.c, then your Apache installation has what you need. Congratulations!

Getting to Know mod_rewrite

Because it works in seemingly mysterious and powerful ways, mod_rewrite has been sometimes described as voodoo. In a nutshell, the mod_rewrite module lets you perform customized URL rewriting deep in the guts of the Apache process, based on any of the properties associated with an incoming request.

In plain English, this means that you can check any property (for instance, the User-Agent: header, Referer: header, the URL of the request, and many others) and perform certain actions based on the value of that property. For our purpose, we rely on the fact that many popular spambot packages are actually dumb enough to announce themselves as such.

I won't go into the gory details of mod_rewrite, as that would take far more room than I have in this article. But I will give you an overview of how mod_rewrite works so that we can check a User-Agent string and redirect the spambot to a page that lets it know we don't allow scraping on our site.

First, we need to enable the mod_rewrite engine. This is done by including a simple set of commands in your httpd.conf file, as shown in Example 1. The RewriteEngine directive is set to "on," enabling mod_rewrite. The RewriteLog directive turns on logging. In this case, output is written to /var/log/mod_rewrite.log—you may wish to put it somewhere else, for example: /usr/local/apache/logs/rewrite.log. For our RewriteLogLevel directive, logging level is set to zero or silent.

You may wish to increase the logging level a bit during testing, to ensure that you're only catching the files you wish to block and that the redirects are happening appropriately. A setting of nine gives you the most output (far too much output for most cases) and a level of three gives acceptable output for debugging purposes. Once you're done debugging, feel free to set it back to zero or anything below three, depending on your needs. Restart Apache using the apachectl restart command after you change your configuration settings, to make sure they take effect.

With a LogLevel setting of three or higher, you can supervise the mod_rewrite engine while it's running. Just run tail -f/path/ to/log. To test whether things are working properly, telnet to the server's HTTP port (usually 80) and request the root document while supplying a User-Agent string that matches the various spamware agents in the rewrite rules discussed below. See the file session.txt (available online) to view the output of a test that worked. The output you can expect from the Apache logs for a successful redirect at a RewriteLogLevel of three is in log.txt.

Finally, our configuration loads the mod_rewrite rulesets by way of Apache's Include directive. All of the directives associated with mod_rewrite are wrapped in a conditional IfModule block, which makes sure that mod_rewrite is operational before trying to read them.

Laying Down the Law

The ruleset itself, shown in Listing 1, includes several conditionals (RewriteCond), each of which may take several arguments. Server variables are referenced with the %{SERVER_VAR} construct.

In the first conditional, we make sure that we're only checking requests for HTML files. These are the files the spambots will be searching through for email addresses to scrape. The %{REQUEST_FILENAME} server variable contains the resolved path to the requested file. We check to see that the variable ends with html?, which covers .html, .htm, .xhtml, .htm, and anything that ends in htm or html. The ? means that the l is optional, and the $ binds the match to the end of the string.

Once we've determined that the request in question is an HTML file, we then compare the contents of the User-Agent: HTTP header with a list of known spambot signatures. For example,

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]

checks for any User-Agent that begins with EmailSiphon, the name of a common spambot.

The carat (^) binds the match to the beginning of the User-Agent string. The [OR] at the end is a mod_rewrite operator that lets other conditionals follow. At the end of the list of User-Agents for which we're checking, there is no [OR]—this terminates the compound conditional. In English, it reads something like this: "If the user agent is requesting an HTML file, and it identifies itself as EmailSiphon or EmailWolf (or other spambots), then do the next thing."

The "next thing," as you might have figured out, is a redirect to a page containing information about why the requested page wasn't delivered, who to contact for more information, and so forth. On our page, we also include a mailto: link to our abuse address, abuse@hesketh.net, for those spammers who are dumb enough to announce themselves to the folks most unlikely to enjoy being spammed.

The redirect is expressed using the RewriteRule directive, which simply redirects all matching requests (^.*$) to the URL in the next argument. The [R] operator tells mod_rewrite to redirect the visitor to the page. Another option is to use a "pass through" or [PT] operator instead of issuing an HTTP redirect. This is most useful for situations in which your configuration involves many Aliases and the like, as it simply rewrites the guts of the request record so that subsequent modules (such as mod_alias) can do the right thing.

Insert the patterns from Listing 1 into a file—in my example, I've called it nospam.conf—and then load it using Apache's Include directive, as I discussed earlier. This lets multiple servers and virtualhosts on the same machine use the same mod_rewrite rules. It also lets you update the patterns in the event that you need to block new spamware with new User-Agent signatures. Note that you may need to include the <IfModule mod_rewrite.c> block in the config for each virtual host, depending on your setup and configuration. Finally, restart the server.

Voila! You've successfully protected your Web server from the most egregious spambots, making it possible to post your users' email addresses on your Web site while preventing undesirable elements from stealing them for nefarious purposes.

A final caution—some spambots masquerade as well-known browser software, rather than announcing their own IDs. That means that this technique above won't block every attempt to scrape addresses. However, the methods described will protect you from, if you'll pardon the pun, the bulk of the spammers out there.

Other Solutions

What else can you do to keep employees' addresses out of spammers' hands? One school of thought suggests that you severely curtail network activities. For example, employees shouldn't post to Usenet, or if they post, they should use bogus email addresses; they shouldn't participate in publicly archived mailing lists; they shouldn't post their email address on any Web site; and so on. (For related information, see " Online Resources".)

I have a problem with this approach: it means that spammers have won. Making it difficult for people to contact your business out of fear that your users might get spammed is a losing proposition.

One solution is to use JavaScript to print any mailto: links and other occurrences of your address, as seen in Listing 2. To print a mailto: link or your email address, simply insert the HTML shown in Listing 3 into the document where you want the link or address to show up. Beware that this won't work in browsers that don't understand JavaScript, or in browsers with disabled JavaScript.

Another option is to use HTML entities to encode mailto: links and other mentions of your address so that extremely brain-dead spamware can't scrape it, like so:

<a href="http://www.newarchitectmag.com/documents/s=4316/new1013636172/mailto&#58;schampeo&#64;hesketh
&#46;com">

Send me email!

</a>

To do the same thing with your address, simply replace the @ with the HTML entity for that character, &#64. Then scatter other entities throughout the address, for example, using &#46; for the "." in your domain name. Web browsers will translate the entities into the characters they represent, but spamware is unlikely to understand the encoding. In the future, however, as spammers and their software get smarter, tactics like these may prove to be relatively poor solutions.

Some mail servers also allow "plussed" addresses, which can be used to track who is sending spam. For example, if I fill out an untrusted Web form at example.com, I might add that domain to my address, like so: schampeo+example.com@hesketh.com. Then, if I do get spammed, I'll know who did it. Check with your mail server vendor to see if your software can accommodate this practice.

Make sure that your users don't reply to spam. Asking to be removed from junk mail lists only confirms that a given address is valid.

There are several other approaches to preventing addresses from being harvested, including giving out fake addresses and using obfuscated or invalid HTML on Web pages (see Example 2). I don't recommend these, however, because in using them, you're just giving in to spammers, while making it more difficult for people to send you legitimate email. These approaches can also cause problems for innocent people and systems administrators who have to clean up the mess.

One tactic I do recommend is the use of spamtraps—addresses that you control, but that have no other use besides catching spammers. I have several unpublished freemail accounts that receive nothing but spam, which I then report to the appropriate authorities.

Indeed, in my view this is the best way to combat unwanted bulk email. Mail administrators should make it a policy to immediately report spam to the ISP from which it originates. Many ISPs enforce an Acceptable Use Policy (AUP) that explicitly forbids bulk mailing. Report abuse as soon as it happens and as many times as necessary until either more ISPs wise up and start policing their customers, or until the cost of spamming becomes so high that it loses its appeal.

(Get the source code for this article here.)


Steven is CTO of hesketh.com/inc. in Raleigh, NC, but this doesn't free him from the awesome responsibility of managing the popular Webdesign-L mailing list. Reach him at schampeo@hesketh.com.




  Day of Defeat Online Gaming

home | daily | current issue | archives | features | critical decisions | case studies | expert opinion | reviews | access | industry events | newsletter | research | careers | info centers | advertising | subscribe | subscriber service | editorial calendar | press | contacts



MarketPlace
Check Out "Best Practices" for maximizing the availability and performance of distributed systems. Make your own contribution and you could win an iPod NANO. At worst you will learn something that could help you go home early for a change.
AdminiTrack offers an effective web-based bug tracking system designed for professional software development teams.
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata -- all in an intuitive user interface.
Full API hooks ExtraView bug tracking into your applications, Web, source control & testing tools.
Easily create an automated, repeatable process for building and deploying software.
Wanna see your ad here?


Copyright © 2006 CMP Media, LLC Read our privacy policy, your California privacy rights, terms of service.
SDMG Web sites: BYTE.com, C/C++ Users Journal, Developer Pipeline, Dr. Dobb's Journal, DotNetJunkies, MSDN Magazine, Sys Admin,
SD Expo, SD Magazine, SqlJunkies, The Perl Journal, Unixreview, Windows Developer Network, New Architect

web1