Site Navigation
Vector Space Search Engine Building a Spider Indexing the Internet Vector Space Postgres Programming What is a robots.txt file Stop List
Some Other Sites
Harry Jackson Job Site The Banana Tree HR-XML Builder -->

Building a Search Engine

Last updated 22 July 2014

Most individual site search engines in use today are fairly simple pieces of software. Most either rely on the speed of a particularly langauge to parse text or build indexes in a database of one sort or another. The sort of search engines I am interested in are on a much grander scale. Search engines like Google or Yahoo fascinate me and I would dearly love to build a working model based on current research in this field.

Unfortunately the amount of resources required to construct an internet search engine like Google are huge so I will be scaling the operation down to a more manageable level. I have limited resources so the searched set will be a fraction of what would be needed to create an internet search engine. However, the work I have done so far indicates that with a decent desktop PC it is quite possible to impiment a very efficient multiple site search engine and I have not started any optimisation yet.

This site is dedicated to the things I did in order to construct both the spiders and the search engine and all the associated tolls that go with them. Eventualy I hope to provide a simple interface to my search engine but that is quite far in the future.

I eventually decided to build a robot first then the search engine but on a smaller scale. The robot will trawl the internet looking for links, it then stores them in a postgres database. It is going to be a big database and if you follow my progress you will see how I am getting on. Just follow the Blog Links on my personal website. Alternatively there is a Road Map for the project.

I would really like to see more people getting involved in the idea and perhaps turn it into a distributed search engine. This way we could start to scale up and start indexing larger areas of the internet. The indexed data could then be uploaded into a master database where it can be searched as you would a normal search engine. It's a bit 'pie in the sky', but it sounds like fun to me. It would also be a good way for company's to index their own website rather than me trawling over it.

It might not be everyones cup of tea but, if you have experience with Postgresql, Perl and a little ETL and want to get involved please drop me a line. You will need to be capable of working independently and you must be a good problem solver. At the minute all the tools are just a bunch of roughly hacked scripts and you would need to be able to sort through them yourself. I can give you pointers to get you up and running.

Hardware Updates

March 2004

I started to run out of disk space etc and had some trouble with the SCSI disk so I have rebuilt the system as follows and it appears to be working a treat.

     COMPUTER
     1 Full Tower ATX.
	 1 Athlon 64
	 1 MS1 K8T Neo Motherboard
	 	 
	 RAM
	 1Gb 2700 DDR (I need more of this)
     
     DISK SPACE
	 2 20Gb Harddisks 
	 1 160Gb SATA 
     I really need lots of SCSI here because the IO required is quite high
     	 
	 NETWORK
	 600 Kbits (runs closer to 522Kbits )
	

June 2003

I have managed to get 1Gb of RAM into the machine. This made quite a difference to Postgres when I cranked the shared memory up a bit. Unfortunately due to the SCSI disk failing on me I have now run out of disk space to hold the data. I have a large (160Gb) SATA disk but no controler card at the moment so as soon as I get one of them I will be starting again.

For people interested in some details the following is the hardware I started this project with.

     COMPUTER
     1 Full Tower ATX.
	 1 Athlon XP1700
	 1 Ellitegroup K7S5A Motherboard
	 	 
	 RAM
	 512Mb PC133 SDRAM. (Upgraded to 1Gb DDR after a month)

     DISK SPACE
	 3 20Gb Harddisks
	 1 U160 18.4 Gb SCSI 
	 
	 NETWORK
	 1Mbit link (runs closer to 802Kbits )