magazine resources subscribe about advertising

New Architect Daily
Commentary and updates on current events and technologies

CMP Media E-Book

Download your copy today.

Research
Search for reports and white papers from industry vendors and analysts.

This Week at NewArchitect.com Subscribe now to our free email newsletter and get notified when the site is updated with new articles







Day of Defeat Online Gaming

 New Architect > Archives > 1997 > 05 > Features Print-Friendly Version 

Sidebar


The Truth About the Web


By Z Smith

We've been crawling the "public" Web — the sites you don't need to give a password for or pay to view. We set about building our crawler in August 1996, it became operational October 1st, and by October 18th we were crawling at 440 MB/hr over our T-1 line. In January we added a second T-1 line. When gathering just HTML, our crawler gathers about 4 million pages per day, comparable to Scooter, the famed AltaVista crawler.

While one could start at any well-connected site (say, Yahoo!) and just follow the links, we had some data which gave us a head start — donated text crawls from two text search-engine companies and a university research project. We also looked at the lists of URLs served by a major third-level cache (18 million Web-object requests).

Adding these sources together, we were able to build a master list of sites with well over one million site names. We then used DNS to find out how many of these names were actually valid sites and how many were aliases.

Following is a list of interesting facts about the Web, mostly based on data gathered by Internet Archive, but augmented with some stats from Larry Page of Stanford University and public documents from the Web.

How many Web sites are there?

  • One million Web-site names are in common usage.
  • There are about 450,000 unique host machines.
  • If you request the top page from these 450,000, about 300,000 will return one within reasonable time. The rest appear to be intermittent or archaic.
  • About 95 percent of the 300,000 servers are "up" at any given time.

How big is the Web?

We estimate there are 80 million HTML pages on the public Web as of January 1997. The figure is fuzzy because some sites are entirely dynamic (a database generates pages in response to clicks or queries). The typical Web page has 15 links (HREFs) to other pages or objects and five sourced objects (SRC), such as sounds or images. Moreover:

  • The typical HTML page is 5 KB.
  • The typical image (GIF, or JPEG) is 12 KB.
  • The average object served via HTTP is 15 KB.
  • The typical Web site is about 20 percent HTML text, 80 percent images, sounds, and executables (by size in bytes).

The upshot of this data is that it takes about 400 GB to store the text of a snapshot of the public Web and about 2000 GB (2 TB) to store nontext files.

How big are individual Web sites?

  • The median size for a Web site is about 300 pages; only 50 sites have more than 30,000 pages.
  • About 5 percent of all servers have a robot.txt file (for governing how crawlers visit).
  • About 1 percent of all servers have a sitelist.txt file (to aid site mapping and robot revisiting).

How fast is the Web growing?

  • The size of the Web is doubling yearly, but this statistic is losing its meaning because of the growth of dynamic sites.
  • The typical Web page is only about two months old.
  • Dynamic sites are becoming a significant presence; JavaScript is widespread, Java much less so, but growing.

How do surfers use the Web?

  • The typical user downloads around 70 KB of data for each HTML page visited.
  • The typical user visits 20 Web pages per day.
  • One percent of all user requests result in "404, File Not Found" responses.
  • The 1000 most popular sites (out of 300,000) account for about half of all traffic.

Z is vice president of engineering for Internet Archive. He can be contacted at zsmith@archive.org.





  Day of Defeat Online Gaming

home | daily | current issue | archives | features | critical decisions | case studies | expert opinion | reviews | access | industry events | newsletter | research | careers | info centers | advertising | subscribe | subscriber service | editorial calendar | press | contacts



MarketPlace
Deliver your applications on time and on target with IdeaBlade DevForce for Visual Studio 2005. Build robust enterprise applications in a fraction of the time by allowing your developers to focus on the business logic instead of the infrastructure
AdminiTrack offers an effective web-based bug tracking system designed for professional software development teams.
ER/Studio delivers next-generation data modeling. Multiple, distinct physical models based on a single logical model give you the tools you need to manage complex database environments and critical metadata -- all in an intuitive user interface.
Award winning DOVICO Track-IT Suite saves each employee 1 hour per week on timesheet and expense entry. Managers save over 2.5 hours per week on business operation efficiency, project tracking, reduced billing cycles and real-time reporting.
Easily create an automated, repeatable process for building and deploying software.
Wanna see your ad here?


Copyright © 2005 CMP Media, LLC Read our privacy policy, your California privacy rights, terms of service.
SDMG Web sites: BYTE.com, C/C++ Users Journal, Developer Pipeline, Dr. Dobb's Journal, DotNetJunkies, MSDN Magazine, Sys Admin,
SD Expo, SD Magazine, SqlJunkies, The Perl Journal, Unixreview, Windows Developer Network, New Architect

web2