Crawling towards Eternity
Building An Archive of The World Wide Web
By Mike Burner
When you ask what he's up to, Brewster Kahle likes to surprise you. So when designing supercomputers became cliche and everybody was developing new technologies for full-text indexing Brewster decided to archive the Internet. He thought the Internet would fit nicely into a box. The box he had in mind was a tape robot that, if it held enough tapes, could hold terabytes of data. Into this box, Brewster wanted to cram all of the publicly accessible data from the World Wide Web, anonymous FTP sites, USENET news, and public gopher sites.
His vision was that this would become an archive of digital history, available forever as a research tool and time capsule. By visiting this "library," the world would be able to trace the development of technologies, styles, and cultural trends. Brewster knew it would be necessary to transcribe the archive on a continuing basis, lest the media deteriorate or the ability to read it disappear.
Thus was born the Internet Archive. Housed in the Little House on the Presidio in San Francisco, the Archive is actively collecting all the Web data that can be pulled down two T1 lines. This article describes Brewster and his archivists' experiences, and what they have learned from them.
Designing the Archive Crawler
Collecting Web data isn't a black art. You just need a program that can "speak" HTTP and parse the HTML to find links (in the form of URLs) to additional network objects. Such a "crawler" program can start almost anywhere on the Web, and will eventually find almost every public page in existence.
The catch is that to accomplish anything useful you have to keep track of all the retrieved objects. To archive the data, you must put it somewhere, and to do this efficiently you must crawl across many sites at once. If you have a conscience, this must be done without upsetting the millions of Web authors and administrators who have created and are supporting the data.
So, when the Archive set out to design its crawler, three things were accepted as givens:
- The proposed Standard for Robot Exclusion (SRE), which provides mechanisms for Web-site owners to control robot behavior on their sites, must be scrupulously obeyed.
- The design must permit splitting the crawl across several machines, possibly at different geographic locations.
- The objects retrieved must be "bundled" into large files, since the tape robot's filesystem cannot manage hundreds of millions of small objects.
Beyond these constraints, the Archive had several concerns:
- The crawler should be polite, so as not to unduly burden a Web site when it visits.
- The crawling software must use the hardware resources efficiently, to optimize available bandwidth.
- The storage strategy should support the anticipated retrieval needs.
Driven by these constraints and concerns, a common theme emerged in design discussions: how "expensive" it was to hop from site to site. If, for example, there were a huge list of unsorted URLs, and the crawler just grabbed one as needed, it would have to parse the URL to find the host and port, look up the IP address of the host, and fetch the robot exclusions for the site, all before requesting the object that the URL referenced. It would be difficult to prevent the same site being visited simultaneously by multiple crawlers and to retrieve the large number of "bundles" containing all of the objects for a given site.
The Archive concluded that it would be easiest to crawl on a site-by-site basis. Thus, the following design emerged:
- The list of sites would be built by collecting "external" references from Web pages (those references pointing off-site).
- A single process would be assigned a queue of sites, and would crawl a number of them (currently 64) at once.
- Each process would loop through sites using nonblocking I/O in a "select loop" (even though the crawler actually uses the Solaris "poll" system call).
- The crawler would let sites "rest" between object retrievals, so as not to overload any one site.
The advantages of this approach were manifold, and serendipitous ones emerged as the crawler developed. With this design, IP addresses and robot exclusions could be held in memory for each site being crawled, multiple crawling machines would never collide on the same site, and the data for a single site would all be contained in a small number of bundles (often just one).
How it Works
A typical Archive-crawler visit to a Web site begins when the crawling process fetches a site name and IP address, and the port number from the site queue; see
Figure 1. The crawler then opens (or creates) the "crawl queue" for the site, which keeps track of the URL paths on the site that have been, or need to be, retrieved. If the queue does not already exist, the path referring to the root of the site ("/") is put in the queue.
The next step is to retrieve and parse robots.txt, which may contain exclusions indicating paths the crawler should not pursue. Those exclusions are stored in memory for later use. One by one, the crawler fetches references to objects from the crawl queue. The path is checked against any exclusions for the site what's not excluded is fetched. If the object has a content type of "text/html," the crawler parses the contents of the object looking for references to other documents. When one is found, the path is normalized. Relative paths are made absolute, and hostnames are made all lower case. If a reference such as <
img src="http://www.newarchitectmag.com/documents/s=6600/new1013637947/../images/chevette.gif"> were found in the document /chevrolet/classics/chevette.html, the path of the new URL would be normalized to /chevrolet/images/chevette.gif. If the reference is to an object on the same site, the crawler checks to see if it is new. If the reference is new it is added to the crawl queue, which is updated on disk so that the crawl can resume where it left off if it is interrupted.
Fetched objects are written into the large "bundle" files, preceded by a line describing the object that includes the URL, the object size, and the date it was retrieved. You search for an object in a bundle file by reading the description line, then the object (because the size is in the description), then the next description, and so on until the desired object is found. This is tedious, however; an index is obviously necessary.
The Archive indexes Web content differently from search engines such as AltaVista. Those systems create keyword or full-text indexes of the Web's textual content to permit queries such as
find me all Web pages with the phrase "little red corvette" in them.
The Archive is building more of a card catalog, to support queries such as:
Tell me the modification times and locations in the Archive of all objects with the URL http://www.automobile.org/features/safetybelts.html.
For the search services, indexing strategies are core business technologies that directly affect how quickly they keep up-to-date with the Web. For the Archive, indexing is merely a major headache.
Assuming you have twenty gigabytes of disk space, it is pretty easy to create a table of 100 million URLs and another of 200 million object retrievals. Unfortunately, it takes a long time to find anything in such huge tables, unless you can afford a machine with 8 GB of memory to hold the table indexes in RAM. So the Archive has split the data into about a thousand smaller tables, numbered with a hashed value of the server name. While this has made queries tolerable, in the long run that 8GB machine will probably be needed.
The URL Problem
Once the Archive had a crawler design and a cataloging strategy, "all" that was left was programming. If you polled the few dozen engineers who have actually implemented an ambitious Web-crawling effort, they'd probably cite dealing with the domain-name service and answering the question "Do I know about this URL?" as the most annoying stumbling blocks.
There are at least 100 million unique URLs on the Web. They include HTML documents, images, sound files, movies, applications, and a host of less common file types. As the crawler locates URLs in HTML documents, it must decide whether it has already retrieved the referenced object, or whether this is a new link. The problem is not even as "simple" as comparing it to the 100 million known links, since the same machine may have multiple names. The URLs http://www.automobile.org/vw/bug.html and http://home.automobile.org/vw/bug.html may be two names for the same object.
The straightforward approach would be to load up a relational database with all URLs including all aliases and just query against it when a URL is located in an HTML document. But when fetching ten pages per second, with an average of 18 links per page (including href, src, background, url, code, and lowsrc), standard database engines just could not keep up.
The Archive's solution is a giant bitmap, renowned as the Swiss army knife of high-performance programming; see
Figure 2. The trick is to allocate a chunk of memory, zero it, and set bits to keep track of what you have already seen. To choose the bits, the crawler computes ten hash values of the URL string, using ten different hashing algorithms (well, actually, it uses one algorithm and ten different tables of ASCII-integer mappings). So when an URL is located, its hash array is computed, the appropriate bits are checked in the bitmap, and if all ten are already set, the URL is deemed "already found" (though not necessarily "already visited"). If any of the bits are not set, the URL is new; it is added to the queue and its bits are set in the bitmap.
This approach is very fast, but may render "false positives" (identifying an unseen URL as already seen). Statistically, a bitmap with two bytes for each string being tracked is pretty safe. One has to wonder, though, if "index.html" hashes to the same values as some other common filename.
As fast as the giant bitmap is, it is still difficult to work with the whole Web. Two bytes times 100 million is a lot of memory. Moreover, it is nearly impossible for multiple crawling machines to synchronize their bitmaps in real time. This is one of the problems serendipitously solved by the Archive's decision to crawl by site. Because most links in an HTML document are "local" to the site, most of the "Have we seen it?" questions can be answered by checking a bitmap specific to the site being crawled. This bitmap can be much smaller (usually 256K bits), and need not be synchronized with other bitmaps.
External references are simply written to disk and batch-processed by a separate program. That program is the only one that needs to wield a 2GB bitmap, and that bitmap can be saved to disk when not in use.
The domain name service (DNS) is a clever, distributed database that allows the machines on the Internet to recognize each other in the absence of a central naming authority. A few top-level servers can tell you what machine to contact for information about machines in a given domain. A "domain" designates an organization, such as harvard.edu or microsoft.com; there are currently more than 800,000 registered domains in the United States, and many more abroad. For a domain to work properly, it must have a "nameserver," a machine that knows the names and addresses of all of the machines in the domain.
So to get to cus.cam.ac.uk, a Web browser would first contact one of the top-level servers, which would redirect it to a server in the United Kingdom, which in turn would likely redirect it to a machine at Cambridge University that would provide the IP address for cus.cam.ac.uk. This name-resolution scheme works fine for casual browsing, which only requires a name lookup every few minutes, but it becomes a serious bottleneck for high-speed, widely distributed data collection.
Another problem with DNS is the issue of uniqueness. It is common for a single machine to have several names; this creates intuitively named aliases for machines that provide common services (such as "www," "mail," or "news") and is a simple means for creating "virtual domains." Clearly, it would be undesirable to crawl the same machine multiple times, just because it had multiple names within DNS. Worse yet, multiple instances of the crawler might simultaneously access the same machine, causing undue load on the site.
To avoid crawling the same data multiple times, the crawler must treat all names that resolve to the same IP address as the same host. So the IP address of a newly resolved hostname must be compared to all other IP addresses in the crawler's site queue to determine if the address is new.
Unlike most crawling groups, the Archive has a third problem with DNS: tracking its history. The global namespace is in constant flux. The name www.automobile.org may point to model-t.automobile.org today, hudson.automobile.org next month, and ferrari.automobile.org next year. The Archive needs to keep track of these mappings, so that when a user asks to see www.automobile.org from December of 1996, the Archive will know that means the data fetched from model-t.
Therefore, the Archive created a database to map DNS over time. Each record has a value, a type, an entry date, a superseded date, and an ID number. The value may hold either a name or an address and the type may be canonical, address, or alias. The "canonical" name of the machine is what the Archive actually calls it; it is always a canonical name within DNS, but is arbitrarily chosen if a given machine has multiple canonical DNS names. An address is the IP address that the Archive uses; if a machine has multiple IP addresses, the "extras" will be stored as aliases. An alias is a DNS alias ("www" is an alias about half of the time), an extra canonical name, or an extra address. The entry date is when the information was first looked up. The superseded date is when it became invalid; a current entry has an empty (NULL) superseded date. All records referring to the same canonical record share an ID number.
Figure 3 shows how to retrieve the current address for www.automobile.org in SQL. In English, that would be:
get me the address that hasn't expired for the host that is currently called www.automobile.org
The DNS map is maintained separately from the crawl, so the crawling software never has to wait for name resolution to retrieve an object. Occasionally, the name-address pairs will become obsolete, but this is rare enough that the cost of revisiting is far lower than the cost of resolving the name more frequently.
Despite the trials of implementation, the tribulations of adequate network connectivity, and the travails of maintaining tape robots driven constantly at capacity, the Archive is being built. The crawler has essentially been running nonstop since October 1996, and as of the end of February 1997, the Archive has about 2.5 terabytes of data under management.
What will become of all of the data the Archive is collecting? The answer will depend on how copyright laws are interpreted as they relate to digital media. Perhaps it will be a decade or more before the time capsule can be opened for all to see, or perhaps an acceptable means of electronically "permitting" the publication will emerge much sooner.
Those who favor the latter see the Archive offering two essential services: a citation service for the Web and a solution to "link rot." Who has not been referred to a page by a publication, only to find that the content has changed? And most of us do not go a week without encountering a "404" (File Not Found) when following a link from our favorite search engine, because the referenced page is no longer there.
What if Web Techniques could write "for an early example of the use of frames, see archive:19951012/www.webtechniques .com/frames.html," and be confident that its readers, even years hence, would see what they were meant to see? Or, what if you could actually find that page at http://www.irs.ustreas.gov that explained the 1995 self-employment tax rules, even though you were being audited in 1998?
Most people believe the Archive is a Good Thing, but it is a monumental task, and they are happy to leave it to Brewster Kahle and his cohorts at the Internet Archive. (Also see "The Truth About the Web".)
Mike is a systems architect at the Internet Archive and wrote the Archive's Web-crawling software. Contact him at email@example.com.