How to build a web spider

Building Spider / Robot

Disclaimer

None of the code below comes with any guarantee or or warranty as to its suitablity for any purpose use it at your own risk. I have not tried to optimise any of the code found here, as soon as it workied I stopped any development on it. This document is also under construction as of the 04 Nov 2003. Please bear with me.

My Take On Performance

I do not have a bee in my bonnet about optimising code, I do however have a one about premature optimisation. I will only optimise a piece of code if I have a severe problem with it. I do not believe in optimising and profiling every bit of code I write as I write it. I prefer to get something working then look at the whole system rather than doing it piecmeal, this technique has served me well, try it.

Spiders Robots and Crawling

The first thing we need if we are going to build a search engine is a lot of searchable data. I could have generated my own using perl and the standard unix dictionary but this would have been unrealistic and pointless. Besides, I have just written two robots to harvest links from the internet so I though I might as well use them to get some random pages off the internet.

Robot and Spider Netiquette

Before starting to run a web crawler (I use robot, spider, web crawler, etc interchangeably in this document) it is a good idea to find out a bit about robots, in particular the robot.txt file. This file should be found in the root directory of every website. It tells robots etc what they are allowed to download from the site. Please have a look at the this example robots.txt file, at CNN, to see what it should look like. It is considered very bad manners to spider in places where the robots.txt file has specifically asked you not to look. There are also other reasons for having a robots.txt file. Consider a directory that has ISO images, and they have not excluded the directory on the robots.txt file. You are going to spend an awful long time downloading a lot of useless data. So please obey the robots file.

Polite crawling

It is also considered bad manners to over spider a site. What I mean by this is that you should not hit a site too often. If we consider a large site like the BBC. When your spider comes across it, it is going to get an awful lot of links that point back to the BBC, because of this your spider could hit the BBC thousands of times in one evening. This is not good for them or you because you will probably be banned from accessing the site and you are eating their bandwidth. Take it easy on websites, I will detail some ways of doing this in my code so that we do not offend anyone.

Building a Robot

I chose Perl to build my robot. I have numerous reasons for this, not least of which is its text handling facilities and the LWP modules. There are modules already written that adhere to the robots.txt file, this means I do not have to build my own, to be honest half the robot has been written for me, I just plugged it all together. To build a robot that can continually trawl the net it needs to be recursive. What I mean by this is that it must be able to find the links on each web page and use those links later. This means we need some way to store the links. I will be using the following postgres database to store all the links that I find. Most people would prefer to leave the links in the file in which it was found and just parse the file for them later. This is a good way to do it, and I will be writing a my tools to do this but my previous robot was only a link harvester so I had no need to store the pages I was spidering. I simply parsed the page in memory saved the links to a database then discarded it. Now that I am building a serach engine I need these pages so our new spider will save them to disk for later processing.

A day in the life of my old robots

When my old spider was started the first thing it did was make a call to the database for the next 10000 links to be spidered (When I stared it for the first time the database had been seeded with a few links). It puts these links into a an associative array (HASH) along with the url_id. The following snippet shows you what I did.

          while($sth->fetchrow_array())
          {
              $item = $base_url;
              $item =~ s/(http:\/\/.*?)\/.*/$1/;
              $domain_counter{$item}++;
              next if ($domain_counter{$item} > 3);
              $url_list{$base_url} = $url_id;
              $counter++;
          }

The hash "%url_list" is the set of URL's that are going to be spidered. This single little bit of code above goes a long way to being a bit more polite when the spider is crawling. It effectively acts as a hit limit. One thing to bear in mind though is how many robots that you start and if you need to restart the robots for any reason. If you do it starts again and it will hit another 3 in the 10000 it selects from. This is OK if you only start a couple of robots but if you start 30 then that is a possible 90 hits every time you need to restart them. It is still a long way off being polite but it is a little closer. I will gradually develop better ways to impliment a hit limit as we go through this document.

postgres database

The Robot Code

I am not saying that this is the best way to build a robot nor am I saying that it is the worst. However, I can say that using this robot I have managed to dowload 1.4 Million pages in 6 weeks, and that is not running it 24/7. From this 1.4 million I have extracted over 80 Million links. There is also an accompanying robot that should really be run before this one to check that each URL is valid but I will leave this as a task for the reader to do since half the code is here already. I am also aware that this spider is not really recursive, I did this for my own reasons but I will show you how to write a recursive spider later.

This spider was a quick hack so please do not email me telling me I could have done it a better way, I know I could, thats the purpose of this document ;-) It is not a very efficient way to do it either. I will explain a much more efficient way to do it later.

	    
            #!/usr/bin/perl -w
            use lib qw( /home/harry/bin);
            use strict;
            use warnings;
            use LWP::Simple;
            use HTML::LinkExtor;
            use LWP::RobotUA;
            use HTTP::Request;
            use HTTP::Response;
            use HTTP::LinkSnuffler;
            use DBI;
            use URI::URL;
            
            $SIG{HUP} = \&handle_sig;
            my $finish_quickly;
            
            my $start_range = $ARGV[0];
            
            my ( $counter, $dbh, $sql, $sth, $url_hash_ref) = HTTP::LinkSnuffler::setup($start_range);
            my ( $error_code, $request , $response , $parser, $date, $base_url);
            
            print "First bit of SQL Finished for: $start_range\n";
            
            my $ua = new LWP::RobotUA('linksnuffler/0.1', 'harry@uklug.co.uk');
            $ua->delay(0.05);  # be very nice, go slowly
            
            foreach $base_url ( keys %$url_hash_ref )
            {
                next if($finish_quickly);
                --$counter;
                print "\nCounter ---> $counter\n";
                print "\nWe are now working on: $base_url\n";
            
                ( $parser, $response )  = HTTP::LinkSnuffler::get_parser($ua , $base_url);
            
                $sql = qq{ select update_state( ?, ?, ? ) as output}; 
                $sth = $dbh->prepare( $sql );
                if($response->is_success)
                {
                    eval
            	    {
                        system("lynx -dump $base_url | gzip > /links/main/htmldocs/$url_hash_ref-<{$base_url}.gz"); 
                    };if($@)
                    {
            	        print "System Error: $@\n";
                    };
            
                    $error_code = 1;
                    # If we find an error then we allocate state 2 so that we can
            	    # avoid it in the future.
                    eval 
                    {
                        $sth->bind_param( 1, "$base_url"  , $DBI::SQL_VARCHAR );
                        $sth->bind_param( 2,  $error_code , $DBI::SQL_INTEGER );
                        $sth->bind_param( 3,  $date       , $DBI::SQL_INTEGER );
                        $sth->execute();
            	    $dbh->commit();
                    }; if ($@) 
            	    {
            	        print "Error updating state $@\n";
                        exit 1;
            	    };
                    print "Success with: $base_url\n";
                }else
                {
                    $error_code = HTTP::LinkSnuffler->get_error_code( $response-<status_line, $counter);
                    eval 
                    {
                        $sth->bind_param( 1, "$base_url"  , $DBI::SQL_VARCHAR );
                        $sth->bind_param( 2,  $error_code , $DBI::SQL_INTEGER );
                        $sth->bind_param( 3,  $date       , $DBI::SQL_INTEGER );
                        $sth->execute();
            	    $dbh->commit();
                    }; if ($@)
            	    {
            	        print "Error updating state $@\n";
            	        exit 1;
            	    };
                   
                    print "Not successful trying: $base_url\n";
                    next;
                }
            
                $parser->parse( $response-<content );
                my @links = $parser->links;
                
                # We have incorporated a lot of the logic into the
                # function in the database. We need to make sure that
                # we are storing the data correctly or we are fscked.
                $sql = qq{ select insert_child_link( ? ,? ) as output }; 
                $sth = $dbh->prepare( $sql );
                foreach my $link (@links) 
                {
                   next unless ($$link[0] eq "a");
                   next unless ($$link[2] =~ /^http:/);
                   if ($$link[1] eq "href")
                   {
            	       my $length_url = length($$link[2]);
                       
                       if ( $length_url >  2000 )
                       {
            	       print " $length_url\n";
                           next;
                       }	   
                        
                       print "$url_hash_ref->{$base_url} ----< $base_url ----< $$link[2]\n";
                       
                       eval 
                       {
                           $sth->bind_param( 1, "$base_url"  , $DBI::SQL_VARCHAR );
                           $sth->bind_param( 2, "$$link[2]"  , $DBI::SQL_VARCHAR );
                           $sth->execute();
                           $dbh->commit();
                       }; if ($@) 
            	       {
            	           print "Error $@ \n";
            	           exit 1;
            	       };
                   }
                }
                
            }
            $sth->finish();
            $dbh->disconnect();
            sub handle_sig()
            {
               $finish_quickly = 1;
               print "I am terminating gracefully\n";
            }

Problems with the above robot

Now that we have got a robot running how do we get the text from the websites. You will notice that the robot above does not save any of the text downloaded from the website. This is because I started the robot as a link extractor and was not going to do anything else with it, except make it more polite and tidy the code up a bit. The design needs to be changed to download pages to disk then parse the file for links. This is not only a better use of my resources but more importantly, it is much more polite to the web sites that are being crawled

Fixing my design

We have seen that the robot above is not really suitable for use as a search engine harvester and to be honest it is not even a very good robot, so we need to modify our method a little to avoid downloading pages more than once. I have already got a database with 11 Million unique links in it so I can utilise this to my advantage. If you want to use the above robot then I would recommend only gathering a couple of thousand links with it then stop using it altogether. We are going to build a better one, or at least try.

How can we improve the design

The most important consideration is the consideration of others. Even if the robot turns out to be pretty crap but polite to other websites, then we have done a good job and everyone is happy, I have had 2 maurading robots much to my shame, and it is not a pretty site. We also want to download a page only once and then be able to manipulate it at will. With these thoughts in mind we can now get on with a conceptual overview of what we need to do.

Grab unique links from a database or seed it with one to start from.
Select a few links from the common domains to ensure we are being netnice.
Check the robots.txt file on the server to ensure we are allowed to download the URL
Check to ensure that the link is good.
If it is good then download the file to a local directory.
If it is bad update the database to reflect this.

After the page has been deposited onto the disk any further processing should not be the job of the spider. We can see straight away that the robot we seen earlier fulfills several of these requirements so we can butcher it to meet our requirements.

The bare minimum Robot

The following piece of code is what I would consider the bare minimum robot for our purposes. Remember that I am only after a suitable set of documents to test the search engine. I am not trying to spider the whole internet. You will also see that this robot thinks it is polite, it is not. it still has some severe flaws.

            #!/usr/bin/perl -w
            use lib qw( /home/harry/bin);
            use strict;
            use warnings;
            use LWP::Simple;
            use HTML::LinkExtor;
            use LWP::RobotUA;
            use HTTP::Request;
            use HTTP::Response;
            use HTTP::LinkSnuffler;
            use DBI;
            use URI::URL;
            my ( $request , $response , $parser, $base_url);
            my $robot_number = $ARGV[0];
               $base_url = $ARGV[1];#
            my %big_hash_of_links;
            my $robot_directory = "/links/htmlpages";
            my $recursive_level = 0; 
            my $filename;
            my $ua = new LWP::RobotUA('linksnuffler/0.1', 'harry@uklug.co.uk');
               $ua->delay(0.0001);  # be very nice, go slowly
            
	    open (ALL_DOMAINS, ">>$robot_directory/$robot_number/all_links.txt");
            
            ( $parser, $response )  = HTTP::LinkSnuffler::get_parser($ua , $base_url, 'HEAD');
            if($response->is_success)
            {
                open(FILE , "<$robot_directory/$robot_number/file_count");
                open(DONE_URLS , "<$robot_directory/$robot_number/urls_done");
                foreach $base_url (<DONE_URLS>)
                {
                   chomp($base_url);
                   $big_hash_of_links{ $base_url }++;
                }
                close(DONE_URLS);
                foreach my $i (<FILE>)
                {
                    $filename = $i;
                    print "File names starting at number: $filename\n";
            	last;
                }
                close(FILE);
                open(DONE_URLS , ">>$robot_directory/$robot_number/urls_done");
                main::loopy_spider(\%big_hash_of_links, $base_url);           
            }else
            {
                print "Unsuccessful URL: $ARGV[1]\n";
            }
            #####################################
            #
            #  SUBROUTINES
            #
            #####################################
            
            sub loopy_spider($$)
            { 
                my ( $big_hash_of_links, $domain ) = @_;
                $base_url = $domain;
                $domain =~ s/(http:\/\/.*?)[\/|#|\?].*/$1/;
                $big_hash_of_links->{$domain}++;
                print DONE_URLS "$domain\n";
                
                print "We are now on iteration: $recursive_level\n";
                print "\nNow Working on link: $base_url\n\n";
		# This variable is used to store each filename as it is used
		# otherwise we would be writing over files each time we start
		# the robot. You will see later that the value gets echoed into
		# a file to make it persistance across restarts
                $filename++;
            
		# This variable tells me how deep the recursion has run. It is
		# more for my own persoanl interest than anything else but we
		# might see a very important use for it later depending on the
		# selected method we choose to build our robot.
                $recursive_level++;
                
		# Before requesting the whole page I try and ensure that the it
		# exists and that it is of the correct filetype. This means I
		# make a header request first and do some checking. This does
		# mean that I will hit the server twice but this is acceptable
		# when we consider wasted downloads and how much this little
		# test can actually save us and the
		# webserver.
                my ( $parser, $response )  = HTTP::LinkSnuffler::get_parser($ua , $base_url, 'HEAD');
                if($response->is_success)
                {
                    if( $response->content_type !~ /html/i)
                    {
                        my $content = 'NOT HTML';
                        main::gzip_to_file(\$content , $filename);
                        return;
                    }
                    
            	( $parser, $response ) = HTTP::LinkSnuffler::get_parser($ua , $base_url, 'GET');
            	my $content = $response->content();
		# We are storing the file to disk so we call this routine to do
		# just that. It uses a pipe to gzip to get some space saving.
            	main::gzip_to_file(\$content, $filename);
            	$parser->parse( $response->content() );
            	   
            	my @links = $parser->links;
            	my %unique_page_links;
            	my $link;
            	foreach $link ( @links ) {
                        next unless ( ( $$link[0] eq "a" ) 
			    and ( $$link[2] =~ /^http:/) 
			    and ( $$link[1] eq "href" ) );			 
            	    $domain = $$link[2];
		    # This regexp is not ideal but is fairly good at
		    # stripping the domains from the links that we found.
		    # We make sure that we have not seen the link before by
		    # checking our big_hash_of_links. If we have seen that
		    # domain before then we skip to the next link found.
            	    $domain =~ s/(http:\/\/.*?)[\/|#|\?].*/$1/;
            	    if($big_hash_of_links->{$domain}) {
            	       next;
            	    }
		    # If we get this far the domain the link belongs to has not
		    # been seen before so we can add it to a hash of links that
		    # we need to search.
                    # We can easily select to do home pages only by swapping these two around.
                    # It is currently set to get home pages
            	    # $unique_page_links{ $$link[2] }++ ;
            	    $unique_page_links{ $domain }++ ;
            	    
		    # I need to store parent links and the links found for
		    # another project hence the reason I am writing them out to
		    # a file for the time being. Thi may change later as we
		    # improve this robot.
            	    if( $unique_page_links{$domain} < 2 ) {
            	        $unique_page_links{$domain}++;
            		print ALL_DOMAINS "File: $filename Parent: $base_url ---> Child: $domain\n";
            	        print "Parent: $base_url ---> Child: $domain  NEW\n";
            	    } 
            	}
		# We need to undef the array and any other variable that are
		# not needed anymore. I will explain the reason for this later
		# although it is fairly easy to guess
            	undef @links;
            	undef $domain;
            	foreach $link (keys %unique_page_links)	{
		   # First echo the filename out the call our subroutine again.
		   # This is where we it gets a bit odd. We just called the
		   # subroutine from within itself
            	   system(`echo $filename > $robot_directory/$robot_number/file_count`);
            	   main::loopy_spider($big_hash_of_links, $link);
                    }
                }
		else {
                    my $content = $response->status_line;
                    main::gzip_to_file(\$content , $filename);
                }
		# If we have got to here it means that one page and all the
		# links it contains has been successfuly parsed or at least all
		# the domains in that page have been successfully parsed. This
		# means that we are stepping out of th stack one level so we
		# deduct on from our itereation
                $recursive_level--;
                return 0;
            }
            
            # Pretty self explanatory.
            sub gzip_to_file($$) {
                my ($content , $filename) = @_;
                open(GZIP_TO_FILE , "| gzip -c > $robot_directory/$robot_number/$filename.gz") 
		  or die "Unable to open pipe to GZIP: $!\n"; 
                print GZIP_TO_FILE "$$content";            
                close(GZIP_TO_FILE);
                return 0;   
            }

This robot is more efficient than the last one but still suffers from some serious shortfalls. The first noteworthy thing to notice here is that it does not rely on a database. You must pass this robot in a URL as its second argument and off it goes. It then uses that link to call a subroutine "loopy_spider". This subroutine downloads the page if it exists and parses it for links. If it is successful it writes the file to disk and then uses the first of the unique links it found to call itself. This is a truly recursive robot.

Problems with our recursive Robot

There are more than 3 major flaws but, I like to concentrate on the big fish rather than worry too much about the other problems that may just be fixed during the next iteration of the design process.

Problem Number 1

The biggest flaw in this robot it not the way it has been coded although I imagine some of you have had some laughs at it. The flaw is that it is one really ignorant robot. I have put in some bits to ensure that we only hit a domain once but this is the flaw, its hitting domains! If you know anything about the Apache webserver then you will also know that under a single domain we can declare virtual hosts and we are not really limited to how many we can have. This means that one IP address can serve thousands of Domains. This is a very serious problem and it needs to be addressed, immediately. I could, unwittingly, with enough bandwidth bring a webserver to its knees if it had lots of different domains attached to its IP address, this is not good, not good at all.

Proposed solution

It is pretty apparent that to be really polite we need to guarantee we only hit a single IP addrees once over a specified time period. Fortunately for us Perl comes to the rescue again, we can use the builtin function "gethostbyname". Perl has normally got a very easy solutions to what appear to be pretty terrible problems. We also introduce another interesting problem by using "gethostbyname" but I will details it later, it is fairly easy to fix.

Problem Number 2

Although recursion is a great way to solve ceratin types of problems is it really suitable for what we are trying to do. I let the robot run for 24 hrs as an experiment and it started with a footprint of under 10Mb but grew to over 300Mb. This is not really very good considering I have only got 1Gb of RAM on my machine. I did notice the memory requirement increase did slow down rather than speed up but this is because it would skip more domains. If I had an unlimited resource of RAM I could let the recursive robot carry on indefinitely but unfortunately I am running this on a limited resource. I carried out some simple tests using a recursive function that had local variables declared in scope and after a few hundred thousand calls all the memory on the machine was nearly gone.

Proposed solution

Perl allows has no limit on recursion. This means that if we can reduce the memory overhead sufficiently our program could run for a very long time before it starts to take up too much memory. However, having done some tests this method is still flawed so we need to cap our robots somehow. We also need some method of reducing the overhead of our programs sufficiently because a 24 hour robot is not much use to anyone. We will also see that the solution is also tied in with other problems and that solving a problem in one area sometime provides a solution in another.

Problem Number 3

The size of the "big_hash_of_links" hash will get much to large to continue to be an efficient method of checking to see if a domain has been hit before. We are going to use IP addresses in the future which will reduce total length of the string being used as the key but, this is still not going to have much of an affect when we have seen several million pages.

Proposed solution

We do not store all the keys in the hash but in another structure. This can be done in one of several ways which I will go over later.

Improving our model again

The biggest problem is actuall the easiest to solve and for now I can leave it until the next set of source code is displayed. The problem that I will attempt to solve first is number 3. I would have picked number 2 but I know that the two solutions are quite heavily dependant on each other so I will detail problem three's solution first.

Million key hash

A hash with a million keys in it is getting a bit big and unwieldy for our purposes so we need to do something about it. The thing is that the interface to the HASH is very intuative once you are used to it so we would really like to keep it, we would also like to retain at least some of the performance associated with a hash. Having lots of hard disk space, we should really try and utilise this for our storage rather than raw RAM. Unfortunately, we are going to get a performance hit but I am not really concerned about this, we can worry about performance when we have a real problem with it.

Database

Please have a look at the Database Schema. The tables that I am going to be using for this project can be found there. I have also written a couple of simple Postgres functions to insert links into the database, they are pretty easy to follow.

What to store

We know from problem 1 that we need to use IP addresses to avoid producing an ignorant robot. We also want to store some data information so that we can tell when that IP address was last done. This will enable us to go back to the site after a set time and do some more spidering. This is acceptable as long as the time chosen is not too short. Postgres has various builtin functions for handling dates etc so we have no problem there.

We need to save each downloaded page's URL in one table with any children found (this is for my own reasons you may want to do it differently) and we also need to save the IP address in another table with the timestamp. I also intend to start assigning a url_id to each url found and then naming the files we download by their associated "url_id", this is for administration purposes.

The new Improved Robot

This robot still has some flaws but it is a far cry from the first one we built. It uses IP Address's to ensure that we do not hit a single server too often. I is connected to a postgres database so we we now have a persistant record of where we have been and when. We use gzip to zip files up and store them to disk for later processing. The files are stored without stripping the HTML off them. I will be writing a utility later that uses lynx to strip the HTML from them and store them as plain text, or at least what lynx thinks is plain text. We will also use some of lynx's builtin features to strip out some of the unwanted text that we do not want to see in the page.

Finished

For my purposes this robot is finished and the work that I have been doing on it will slow down considerably It is fast enough and polite enough for my needs. It could be improved on considerably and I would welcome any sugestions or additions that people would like to see added. At some point I will move the whole thing to a module so that I can create a spider class from it and make it as abstract as possible. You will notice that the code for the actual spider has already been broken up into various smaller routines. I will post the module as soon as I have finished it and give some details on how to operate it. If anyone wants it to work on another database or platform like MySQL, let me know, I could be tempted but, I am not promising anything.

Possible Improvements

For a good list of things that a robot should be capable of doing the the following Website gives a fairly good description. You will quickly see just how complicated a robot can get. The above robot does not even impliment half of the required functionality. Maybe one day I will get round to implimenting these features.