Sigh. Here I am at work on Tuesday morning. List of jobs to do being interrupted by our web server triggering over load alarms. Actually, it's been doing it for quite a while, but I've never sat down to analyse the logs to find what's happening to trigger the alarm (our gandi.net virtual server is more than powerful enough to cope, so fault finding has been low on my to do list). This morning as I walked to work I saw an overload message arrive in my email. The sun is up, the sky is blue, it's 8am. It feels a good day to fault find...
It didn't take long to find the problem. I used grep to pull out todays log entries from the apache log and put them into a temporary file
me\@server4:/path_to_logs/rkbb.co.uk\$ grep '06/Apr/2010' apache-log > check.txt
The bot causing the problem has a user agent of "Mozilla/5.0 (compatible; Purebot/1.1; +http://www.puritysearch.net/)", going to puritysearch.net I find a 'search engine' that doesn't appear to do anything but display adverts disguised as search results.
So, how to stop this bot. Nice bots read a file called robots.txt which tells them where they're allowed to go. Purebot didn't read the robots.txt so I couldn't excluded it there.
My next thought was to use apache to exclude the user agent. After an hour or so of trying I gave up with that (it is possible, I just didn't figure it out and took the easy for me approach). The site is running Coldfusion (actually BlueDragon) so in the Application.cfm I can check the user agent and stop processing requests from Purebot there.
\<cfset useragenttest = find("Purebot",#cgi.http_user_agent#)>
\<cfif useragenttest GT 0 >
The code isn't my most elegant but it works. Next time I come across a badbot (or Purebot changes it's name) I'll just updated this piece of code to ignore their requests.