Majestic12 – a distributed search engine

Majestic-12 is a distributed search engine that I came across on Christmas Eve. Perhaps I should rephrase that. It came across me [my web server] on christmas eve with such a force the server overloaded and stopped.

Ho Hum. The joys of running a web server is that at 5am on any given day you can recieve a text message saying “Server Down”. Within the laws of science (Murphy’s Law), there is a rule that the probablily of this text message arriving is increased whenever I am a) Asleep b) on holiday and if both a) and b) apply then the chances are quadrupled.

So, what’s a “distributed search engine”? Sort of like Google. Many people may think google is one or two super big computers. In fact, it’s lots (and lots) of simple standard computers connected together. Rather than wait for one computer to answer your search query, lots of computers look for you, each taking charge or a little piece of the database google has built up. Likewise, to fill the google database with information about all the pages on the internet, lots of computers go off and visit all the web sites. In summary, the work is “distributed” between many computers.

Now, all those computers belong to google, but for many years teams/projects have been working on sharing the unused resources of other peoples computers in order to achieve their goal. So, you could sign up for SETI – and while you are not at your computer it will try and find alien signals in data recorded by radio telescopes. Other projects like Rosetta use your computers idle time to find cures for diseases and the ClimatePrediction.net project uses your idle computer time to crunch numbers to help accurately forecast weather.

Now enter Majestic12, a distributed computer solution to searching the internet. It’s in it’s infancy at the moment but it uses peoples idle computers and internet bandwidth to capture information on web pages and use that information to respond to peoples search requests.

Now you know what it is, what happened to my web server?
Well, when software like Google or Majestic12 visit a web site, they are called a “robot” and they should follow the instructions on my web server in a file called “robots.txt. This file basically tells the robot where it can and cannot go. Why should they follow it? Well, if they follow links into the shopping basket they wont find any useful information. Going there wastes their bandwidth and mine, not to mention costing me money. Majestic12’s robot had a problem that no-one knew about. If the robots.txt had a particular value in it, it would ignore the whole robots.txt file. That’s a bug. It ignored my robots.txt and proceeded into the shopping basket where it promptly got stuck in a loop. When in that loop, it made my web server very busy trying to answer it’s requests (to add another item into the basket) and after a short while the server stopped answering requests from anyone.

The simple fix is a restart the server, but I also had a look to find out what had caused the server to stop and saw the log file entry for Majestic-12 I visited their website, saw a user forum and posted a message in ‘bugs’ to say that the robot had stopped my server working. To be honest, I didn’t expect a reply that day, or even until the new year. But the main person of the project replied in 4 hours. Then he traced the problem, created a fix, issued a new version within 2 hours. He appologised for the slow response, he’d be Christmas shopping! 10 out of 10 for effort on the part of “AlexC” – most impressive bug fix time for a small project. The search results are a work in progress, but I think as the Majestic-12 project grows it could become a serious contender to the big boys of the search engine game.