Before a search engine can
tell you where a file or document is, it must be found. To find
information on the hundreds of millions of Web pages that exist, a
search engine employs special software robots, called spiders,
to build lists of the words found on Web sites.
A search engine spider is an automated software program used to locate
and collect data from web pages for inclusion in a search engine's
database and to follow links to find new pages on the World Wide Web.
When a spider is
building its lists, the process is called Web crawling. In
order to build and maintain a useful list of words, a search engine's
spiders have to look at a lot of pages.
Crawler-based search engines have three
major elements. First is the spider, also called the crawler. The
spider visits a web page, reads it, and then follows links to other
pages within the site. This is what it means when someone refers to a
site being "spidered" or "crawled." The spider returns to the site on
a regular basis, such as every month or two, to look for changes.
Everything the spider finds goes into the second part of the search
engine, the index. The index, sometimes called the catalog, is like a
giant book containing a copy of every web page that the spider finds.
If a web page changes, then this book is updated with new information.
Sometimes it can take a while for new pages or changes that the spider
finds to be added to the index. Thus, a web page may have been "spidered"
but not yet "indexed." Until it is indexed -- added to the index -- it
is not available to those searching with the search engine.
Search engine software is the third part of a search engine. This is
the program that sifts through the millions of pages recorded in the
index to find matches to a search and rank them in order of what it
believes is most relevant. You can learn more about how search
Search engine spider
identification
The following is a basic listing of
search engine spider names and their "owners". This is by no means
complete, as there are many thousands of search engines on the
Internet, but it covers the more common beneficial spiders.
Spider name
|
Spider owner
|
Googlebot |
Google.com |
TeomaAgent |
Teoma.com |
Zyborg |
Wisenut.com |
Gulliver |
NorthernLight.com |
Architext spider |
Excite.com |
FAST-WebCrawler |
FAST (AllTheWeb.com) |
Slurp |
Inktomi.com |
Yahoo Slurp |
Yahoo Web Search |
Ask Jeeves |
AskJeeves.com |
ia_archiver |
Alexa.com |
Scooter |
AltaVista.com |
Mercator |
AltaVista.com |
crawler@fast |
FAST (AllTheWeb.com) |
Crawler |
Crawler.de |
InfoSeek sidewinder |
InfoSeek.com |
Lycos_Spider_(T-Rex) |
Lycos.com |
Fluffy the Spider |
SearchHippo.com |
Ultraseek |
InfoSeek.com |
MantraAgent |
LookSmart.com |
Moget |
Goo.jp |
T-H-U-N-D-E-R-S-T-O-N-E |
Thunderstone.com |
MuscatFerret |
Euroferret.com |
VoilaBot |
Voila.fr |
Sleek Spider |
Search-info.com |
KIT_Fireball |
FireBall.de |
WebCrawler |
Webcrawler.com |