Fighting email harvesters and other unfriendlies.

Since I put up this site I have been paying attention to my log files to see how it gets accessed. One of my main motivations for putting up a personal site is not to publish content or personal ideas etc but to study the blogging world, how it communicates and how information flows.
Obviously RSS [1, 2] and other XML technologies are the underlying technology that enables interesting services such as Technorati, Feedster, Blogosphere, Geoblog, Blogshares and many more and a study of this is essential. I have been looking for the RSS book for a while and might have to resort to ordering from Amazon.
There are however a lot more to a website than a XML file. The net is constantly being trawled by unwelcome guests these range from Email address harvesters, services that “monitor” your server, badly behaved search engine crawlers and bad people like the RIAA.
Here I present some strategies for combating these services from simply asking the well behaved ones to go away by using a robots.txt and by forcing the bad ones to go away by using mod_rewrite and other such methods.

First off you need to be able to figure out a web server log by looking at it, you want to be using the combined format or even a custom one that lists more information. The most important things to list are Remote IP Address/Host, Connection Status, Request Protocol, Time, First Line of Request, Referer and User Agent.
Of the above fields the most important ones are what they are trying to see (First Line of Request, Request Protocol), Who they are (Host/IP), what software they are using (User Agent) – often spoofed – and where they were before (Referer).
Once you have this going and can understand it you will notice a whole long list of User Agents. This is a good indication of what is accessing your site. In some cases it will be obvious like in the case of search engines, their referers or user agent fields usually include a URL where you can find out more, Google has “Googlebot/2.1 (+http://www.googlebot.com/bot.html)” as User Agent. Others are web browsers like your standard Internet Explorer says something like “Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)”.
I found a really usefull resource that has a database of User Agents, they list 492 at the moment. They are conveniently categorised by Search Engine, Offline Browser, Validators and Email Collectors and it can even create config files for robots.txt or mod_rewrite, but more on this a bit later.
Now for combatting unwanted traffic. Unwanted traffic can be a search engine that you do not like or one that simply trawls your site too often, it can be people mirroring your site using software such as the ones making many parallell requests, it could be people you do not like (the RIAA comes to mind) and finally it can be people harvesting email addresses for spam lists.
The good spiders or robots will honour the Robots Exclusion Protocol. This protocol allows you to tell a bot to only access certain or no parts of your site. It takes two forms, one is by including Meta Tag into your html or by creating a file called robots.txt in the root of your server. The file is pretty simple and controls bots based on user agent. A sample file below will block the Inktomi/Hot Bot search engine from seeing any pages on your server.

User-agent: Slurp
Disallow: /

Be sure to read http://www.robotstxt.org/ for more information and most importantly take a look at their database of search engines where you will find entries for common engines like the Googlebot. All reputable search engines are registered there. I also found a good tutorial about robots.txt here.
The above approach is effective to control the good guys but is unfortunatly of no use against E-Mail harvesters and other such things. For combating these you need to get tough. A few Apache modules are usefull here, most importantly mod_rewrite.
Using mod_rewrite is not trivial and I suggest you play somewhere other than a live server before going forward with this. I also suggest starting small with one site or possibly even a subset of a site. Also doing this will slow things down slightly so if you do have a server that is under high load this may not be for you.
The basic concept here is to use RewriteCond to pick up on Browsers (Agents), Remote Hosts, Users, Access Methods or even URI’s and set a environment variable that will classify them as such and then use mod_rewrite to either send them to a page with a nice error page or to simply return a error such a 403. You can put this in the main webserver configuration file or in the .htaccess file for certain directories.
I will use .htaccess files in my examples and so I can use the .htaccess deny from and allow from lines to block hosts, this should be a bit faster than using mod_rewrite for denying specific hosts.
A simple .htaccess file that will block the same bot as above looks like this, it will return a 403 error:

Order allow,deny
allow from all
RewriteEngine on
RewriteBase /
RewriteCond ${HTTP_USER_AGENT} " ^slurp [NC]
RewriteRule    .*    -    [F,L]

To take this further you can deny IP addresses from places you do not like, you can use the simple “deny from” entries in the .htaccess for specific IP addresses or for something more flexible mod_rewrite is useful again since it support regular expressions. In this example I will deny some RIAA IP Adresses and a spybot.

Order allow,deny
deny from 80.88.129.28
deny from 211.157.36.7
deny from riaa.com
deny from mpaa.com
allow from all
RewriteEngine on
RewriteBase /
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond ${HTTP_USER_AGENT}  ^slurp [NC]
RewriteRule    .*    -    [F,L]

I hope this is of some help, I will follow up later with details of log analysis tools that can show you stats etc.

Fighting email harvesters and other unfriendlies.

Licence