webmonkey/programming/

Roll Your Own Search Engine
by Brian Slesinsky

With all the hype about Web search engines, you might think writing one is a feat not to be attempted by mere mortals. But writing a small search engine is actually quite easy. Don't believe me? Well, in the space of this column, I'll show you how to add a search page to your own Web site with just a few lines of Perl. (For this to work, your site should be hosted on a computer running Unix, and you need the ability to install CGI scripts.)

[Brian gif]

Coming up with an algorithm

For the most part, a Web site is just a directory tree. If you didn't care about efficiency or response time, you could write a script using find and grep that would search your site the same way you would from the command line. But this brute-force method will bog down as your site grows, since each search has to read every single file on your site.

It's much better to build what's known as an inverted index (also known as an inverted file). This is a table of words much like the index in the back of a book. For example, suppose we had a very simple Web site containing only two pages that look like this:

one.html:

   <html><head>
     <title>Doc One</title>
   </head><body>
     <p>This is document one.
   </body></html>

two.html:

    <html><head>
      <title>Doc Two</title>
    </head><body>
      <p>Here is another document.
    </body></html>

To index this Web site, we need to create two tables. First, we number each page and list its title and URL. This saves space by allowing us to use numbers to refer to the pages later on:

  1 => /one.html, "Doc One"
  2 => /two.html, "Doc Two"

Next, we create the inverted index itself by listing each word and the documents it appears in:

  another => 2
  doc => 1,2
  document => 1,2
  here => 2
  is => 1,2
  one => 1
  this => 1
  two => 2

To do a search, look in the inverted index for the word you want, then look up the Web page listed after that word. So, if you search for the word "here," our script will look up "here" in the inverted index, find "2," then look up "2" and print information about the document as a link:

<a href="/two.html">Doc Two</a>

Similarly, if you type two words, the script would look up both words in the inverted index and list only the pages that contain both.




Deciding on a Data Structure

It wouldn't be hard to store the inverted index in an ordinary file and grep through it for the words we want. Since the index would be smaller than the site, this would be an improvement over using find and grep to search the whole site.
[Brian gif]

For large sites, however, even the index file can be huge, and reading the whole file on every search just to find a few words would be rather wasteful. So, let's use a DBM file. It so happens that both the table of Web pages and the inverted index consist entirely of (name => value) records, which makes it quite easy to map them to strings in the DBM file. In Perl's notation, our example index would now look like this:

  %dbm = (
    '-1' => '<a href="/one.html">Doc One</a>',
    '-2' => '<a href="/two.html">Doc Two</a>',

    'another' => '-2',
    'doc' => '-1-2',
    'document' => '-1-2',
    'here' => '-2',
    'is' => '-1-2',
    'one' => '-1',
    'this' => '-1',
    'two' => '-2'
  );

(I've switched to using negative numbers so any numbers that appear as words within Web pages won't conflict with the document index.)

Building the Crawler

Now we've got two scripts to write: the code that reads all the files on your Web site and builds the inverted index (the crawler), and the CGI script that looks up the words the user enters in the search form. Let's do the crawler first.

[Brain building the crawler gif]

First, we open the DBM file where the inverted index will be stored. I'm going to use the Berkeley DB implementation because it's fast and allows records to be of arbitrary length. We need that capability because there's no limit to how many times a common word such as "the" can appear on a Web site.

Open the index file like this:

  use DB_File;
  dbmopen(%db,"search_index.db", 0644) or die "dbmopen: $!";

The easiest way to find files on a Unix box is, of course, the Unix find command. In this case, we're using it to list all the .html files on the Web site:

  open(FILES, "find . -name '*.html' -print|") or die "open for find: $!";

We open each HTML file in turn and load its contents into a variable:

  my $filename;
  while(defined($filename = <FILES>)) {
      print "indexing $filename";
      chop $filename;
      open(HTML, $filename) or do { warn "open $filename: $!"; next; };
      my $html = join('', <HTML>);
      close HTML;

Then use a regular expression to extract the title and make an entry for this page in the table of Web pages:

      my ($title) = ($html =~ /<title>([^<]*)/i);
      $title = $filename if(!defined $title);
      $db{--$fileno} = "<a href=\"$filename\">$title</a>";

Now we need to make a list of all the words in the page. First we remove the HTML tags:

      $html =~ s/<[^>]+>//g;

If we want this to be a case-insensitive search, it will be easier if all the words are stored in the same case, so let's convert the document to lowercase:

      $html =~ tr/A-Z/a-z/;

Next, we'll make a list of all the words in the document:

      my @words = ($html =~ /\w+/g);

Finally, we'll append the words to the appropriate row in the inverted index, making sure that we don't index the same word twice.

      my $last = "";
      for (sort @words) {
          next if($_ eq $last);
          $last = $_;
          $db{$_} = defined $db{$_} ? $db{$_}.$fileno : $fileno;
      }

That's basically it. Here's the full script. When you run it on your Web site, it will create a file named "search_index.db" in the top-level directory of your site. This file will contain an index of all the words used on your site.

(Note: The script takes a long time to run and, depending on how many documents you have, will create a rather large file. In one test I did, the index file took 40 percent of the disk space used by the original files.)

Building the Search CGI

Now that we have an index, it's time to come up with a way for users to access it. I'm going to implement a simple search that finds only the pages that contain every word the user typed in the form. [Brian gif]

The search form is simple enough:

  <form action="/search.cgi">
  <p><input name=s><input type=submit value="Search">
  </form>

Search.cgi reads the form variable and parses it into words:

  my $query = $ENV{'QUERY_STRING'};
  $query =~ s/s=//;
  $query =~ s/%[0-9a-fA-F]{2}/ /g;
  my @words = ($query =~ /\w+/g);

Next, it opens the DBM file containing the inverted index:

  use DB_File;
  dbmopen(%db,"search_index.db",0);

Our strategy for implementing the query is to keep a counter for each relevant document. We search each word in turn and increment the document's counter when a word is found.

  my %counters;
  my $word;
  for $word (@words) {
      my $pages = $db{lc $word};
      my $page;
      for $page ($pages =~ /(-\d+)/g) {
          $counters{$page}++;
      }
  }

A document that contains every word will have its counter incremented each time through the loop, so its count will be equal to the number of words. The following script will find those documents and print them:

  for $page (sort keys %counters) {
      if($counters{$page}==scalar(@words)) {
          my $href = $db{$page};
          print "$href<br>";
      }
  }

And that does it. Here's the full script. Of course, there are many options that could be added to this little search engine to make it friendlier, but they're just a matter of programming.



Brian Slesinsky is a former HotWired engineer. He left the company to work full-time on his death ray.





 

Feedback  |  Help  |  About Us  |  Jobs  |  Advertise  |  Privacy Statement  |  Terms of Service

Copyright © 1994-2001 Wired Digital Inc., a Lycos Network site. All rights reserved.