Site Navigation
Vector Space Search Engine Building a Spider Indexing the Internet Vector Space Postgres Programming What is a robots.txt file Stop List
Some Other Sites
Harry Jackson Job Site The Banana Tree HR-XML Builder -->

Building A Vector Space Search Engine


None of the code below comes with any guarantee or or warranty as to its suitablity for any purpose use it at your own risk. This document is also under construction as of the 12 Jan 2004. Please bear with me.

Indexing the Internet

An index is a type of pointer, but a pointer to what. With search engines the input cannot be discerned before hand, this implies that an effective index should be dynamic. Lets take the example of a book. When we search the index of a book we know what we want to find before we start but we need a quick way to locate the page (If we think about this its not the page number that we we are after but the information on the page but that is another topic). I have regularly searched IT books for various topics and needed to change my search to something which I consider more obscure to find what I am looking for, and on occasion I am not succesfull. This is because the index in a book is not dynamic. We are limited to a finite set of imput terms, these search terms have already been compiled by the publisher based on what they and the author thought, at that time, was the most likely search topics. This model is used with some success on the internet. Dmoz uses this and we regularly see large directories where we can search through each topic until we reach (or not) the topic we are after.

The problem is that we are not guaranteed to find what we want. So what can we do, is there a better, more effective way. We know we cannot produce a list of pointers for every possible input that we may encounter, so compiling a universal directoy is out of the question. This means that to some extent our engine needs to carry out a search of some magnitude for every input, this search will generate a pointer or list of pointers to resources. Now that we know that we need to carry out a search every time we receive an input how can we do this and be efficient at the same time (I am not going to address caching at this point).

Massive Indexes

If we take google as an example. (When you read this in a few years you will wonder what all the fuss was about)

Google index's, in excess of 3 billion documents. This seems like a staggering figure and don't get me wrong it is.

Lets play around with some numbers. The follwong numbers are just guesses, but they will help you visualise the magnitude of the problem, note, Google does not neccessarily do things the way I am describing. In fact there are few people out there who really know how google actually does its stuff, contrary to what a lot of people will tell you, most have not really got a clue how google works. Its a secretive operation and for good reason.

Lets assume the following:

  • An average web page is 20Kbytes
  • Google indexes 3,000,000,000 of these
  • Google has 100,000,000 hits per day

    This means that google must get approximately 1200 hits a second. For every hit it must search through 60,000,000,000,000. This is 60 Terabytes of data. This is an enormous amount, so how can it be effectively searched 1200 times a second.

    Search [ engines | engineers ] are very clever

    If we think about a web page for a second we know straight away that we do not need a lot of the text in the document. (I know Google uses the tags in a document for term weighting but I am ignoring that for now) We are able to strip out all the HTML and still be left with a document with its semantic meaning intact. In fact, due to abuse of some meta tags, ignoring the mark up is sometimes better. Lets asssume the following.

  • Removing HTML reduces doc size by 20%

    Stop List

    Now that we have removed the HTML we are down to 48,000,000,000,000 Terabytes of data. This is still an astronomical amount. It is more than 2 times bigger than the entire contents of the library of congress. If you look through the text that you have been reading you will notice various words that appear much more frequently than others. These words add structure to the page and make it easier for us to read but do they add much extra information to the document. In the majority of cases they don't.

    For Instance: Now have removed HTML down 48,000,000,000,000 Terabytes data. still astronomical amount. more 2 times bigger entire contents library congress. look through text have been reading will notice various words appear much more frequently others. These words add structure page make easier read add any extra information document. majority cases don't.

    The text above is the previous paragraph with a lot of terms removed. I have not done this scientifically and building a list of stop words is a very personal thing. If I was building a list for a law firm I may remove very different words from that of an electronics company. So to cater for most of the internet we build a stop list of words that we are pretty sure can be ignored.

  • Removing irrelevant words reduces doc size by 20%

    We are now down to 38,400,000,000,000 Terabytes.


    Another method frequently employed is "Stemming". This is the process of removing the commoner morphological and inflexional endings from words. The author's site is a much better place to find out what it does than here. I took the two outputs provided on the site and measured their sizes and found a difference of approximately 14%. So we are now down to 33,024,000,000,000. We have nearly reduced our total searchable text by half but this is still too much and can be reduced much further.

  • Stemming reduces the size by 14%

    Word Translation

    The way we store words in a computer is great for us but inneficient for mass manipulation in binary. The most common method for storing letters in a PC is to use the American Standard Code for Information Interchange (ASCII). What this means is that for every character there is an associated integer value which consists of 8 binary bits. Knowing this I then decided to see what the average English word length is based on words I downloaded from the stemming website. I did this using the following script.

                #!/usr/bin/perl -w
                use strict;
                use warnings;
                my $word;
                my $word_count = 0;
                my $total_word_length = 0;
                my $average_word_length = 0;
    	    open (FILE , "</links/main/secondstage/stemmedvoc.txt")
    	         11  or die "Cannot open file $@\n";
    	    #open (FILE , "</links/main/secondstage/unstemmedvoc.txt") 
    	    #  or die "Cannot open file $@\n";
                foreach $word ( <FILE> )
                    $total_word_length = ( $total_word_length + length($word));
    	    $average_word_length = $total_word_length/$word_count;
                print "$average_word_length\n";
                exit 0;

    The result was a rough average of 7 for the unstemmed version and 6 for the stemmed version. Just out of interest I tried it on the dictionary I found on this machine and got an average of 8. I know this is not every word in the english dictionary, I am only trying to demonstrate the various methods of reducing the amount of data that needs to be searched. I have also used the unstemmed version of the file to do this. If we have an average word length of 7 then the average word takes up 7 bytes of storage. This is an obscene amount of space but unfortunately we can only read words. What we can do now is to to convert each word to an integer. An integer in C++ takes up 4 bytes of storage and can represent a few billion values. With that in mind we have now reduced our storage requirements for each word from 7 bytes to 4. This has taken us to a total of 21,888,000,000,000 Terabytes.

  • We have reduced the unstemmed amount by 43%

    Reducing it even further.

    There are various ways to reduce the amount even further. The best way to reduce the overall size is not to rely on the built in data types supplied by your chosen language. In C++ the integer uses 4 bytes but we do not really need all of these. We could use 3 bytes (21 bits) and still have some bits left over for other purposes which I will detail later. Another method that quickly reduces the size of the overall data set is the stop list. Google has a very relaxed stop list compared to most. For our search engine I am going to be quite ruthless and build a fairly comprehensive list.

    Parsing the data

    Now that we have some idea what we need to do we can now get on with creating the parser to produce the documents to be indexed. The following script is by no means prefect but it does a farly good job of parsing out the files downloaded by the robot.


    You can see from the last few lines what sort of output this produces. It gives me a file that looks similar to the following only with thousands of rows.

    spac 1601662 12961 2
    refer 1601662 50 1
    http 1601662 857 29
    newaccount 1601662 88917 1
    articl 1601662 201 8
    format 1601662 5 1
    current 1601662 136 1
    linuxcalendar 1601662 88918 1
    op 1601662 8917 5
    textad 1601662 88916 1
    authorguid 1601662 88919 1
    prnewswir 1601662 63569 1
    bin 1601662 2589 1
    stori 1601662 742 4
    acct 1601662 22574 1
    svbizink 1601662 63570 1
    edat 1601662 63571 1
    wed 1601662 1602 1
    sep 1601662 4636 1
    am 1601662 149 1
    target 1601662 661 1

    The first entry in the file is our stemmed word in english. The second word is a string that represents the document ID. This is not a very efficient way of storing these because we are duplicating the doc_id on every line. We could reduce the size of the output file considerably if we took our time and came up with a more space efficient method. However, I am trying to get this whole thing finished in a few days/weeks so I will leave the fancy binary storage etc for you ;-).

    I have not got any further yet but if you send me an email then it might spur me on to finish it.

    I am working on this bit at the moment 12 Jan 2004