Internet Archive
The Internet Archive: Building an 'Internet Library'

Home  

News  

Jobs  

Contacts  
Internet Archive logo
In the Collections
Using the Collections
About the Archive
How We Acquire Collections    Storage and Preservation

In the Collections

The Internet Archive’s collections include World Wide Web pages, FTP sites, and Usenet bulletin boards. The Web collection is open to researchers, historians, and scholars.

SEE THE SMITHSONIAN'S DISPLAY OF 1996
PRESIDENTIAL ELECTION SITES

The pages below are from an
Internet Archive snapshot of the Web
captured in the months before the 1996
presidential elections.
Page from a snapshot of the Web, now in the Smithsonian

Who's Using the Web Collection

The Library of Congress: Lobby Sculpture
Smithsonian: 1996 Elections Display
Xerox PARC: Research Projects
IBM: Research Projects
AT&T: Research Projects
NEC: Research Projects
USFGIC: Global Information Locator Service

Read more about the Web collection
Find out how to get access to the collection
Read about ideas for other ways to use Internet libraries

Top of Page

World Wide Web Pages in the Archive

DATES:
October 1996 to now
SIZE:
13.8 terabytes (about 1 billion pages, text only during 1999)
RATE OF GROWTH:
About 2 terabytes a month as of March 2000
ACCESSIBLE:
From late 1998 to six or more months ago (the collection contains no material less than six months old), or about 3 terabytes as of March 2000. We hope to make the rest of the material (collected from late 1996 to late 1998) available during 2000
ACCESS:
See the Archive’s Terms of Use

Top of Page

Here’s how some organizations are using the Web collection.

Library of Congress: Lobby Sculpture

"World Wide Web 1997: 2 Terabytes in 63 Inches"
Alan Rath, 1997
Software by Art Medlar
Aluminum, computer, electronics, digital tape
Library of Congress, Washington, DC
Hardware and software gift of the Internet Archive
Data gift of Alexa Internet

If the whole World Wide Web were gathered into one real-world location, what would it look like? Visitors to the Library of Congress find out as they interact with a sculpture in the lobby — a stack of bright red computer monitors and storage tapes housing a snapshot of the Web from early 1997. The sculpture displays over a thousand sites, which flash by on the screens at the rate of several sites a second.

WATCH A DEMO OF THE SCULPTURE
(if you have an ISDN or faster connection)

WATCH A DEMO OF THE SCULPTURE
(if you have an ISDN or faster connection)
Sculpture of 1997 Web snapshot in the lobby of the Library of Congress

Top of Page

Smithsonian Institution: 1996 Presidential Elections Display

A display at the Smithsonian Institution shows how presidential candidates and parties first used the Web. The display includes 1996 campaign pages for five political parties — as well as pages such as the "Steve Forbes Official Home Page" and the "Official Internet Headquarters of the [Pat] Buchanan Brigade," which were captured before some candidates dropped out of the race and scaled back or shut down their sites.

The display also includes pages from the Federal Election Commission site with financial information about candidates, parties, and political action committees.

SEE THE SMITHSONIAN'S DISPLAY
or SEE MORE '96 ELECTION SITES
including voter advocacy and news sites, more candidates' sites, and parodies of the candidates

SEE THE SMITHSONIAN'S DISPLAY
Page from a snapshot of the Web, now in the Smithsonian

Press Release from the Smithsonian Institution
"National Museum of American History Tracks Presidential Election Process From a Web Perspective"
7 March 1996

"Internet in a Box"
Spencer Reiss, Wired, October 1996, page 72

Top of Page

Xerox PARC: Research Projects

"It Grows on Its Own Like an Ecosystem"

The Internet Ecologies Area at Xerox’s Palo Alto Research Center is using multiple snapshots from the Internet Archive on disk — "the Web in a box" — as a kind of test tube for understanding the Web. "We see the Web as an ‘information ecology,’ where we study the relationships between people and information," says PARC researcher Jim Pitkow.

PARC "benefited greatly" from access to the Archive’s crawls, says Pitkow’s colleague and Stanford physics professor Bernardo Huberman. According to Pitkow, access to the snapshots "is great for researchers because it lets them fuse traditional tools and techniques with new tools that haven’t existed before."

Huberman describes a PARC study that produced a mathematical "law of surfing," which says that Web traffic follows predictable, regular patterns. For example, in a manifestation of the "winner take all" principle, it turns out that just a few Web sites get most of the traffic. The researchers were also able to show how deeply people delve into a typical Web site: on average, it’s about a page and a half. Huberman has also studied Internet congestion as a social dilemma, where people weigh the costs and benefits of putting up with slow traffic versus waiting until the network is less crowded.

In a study of the topology of the Web, a Stanford graduate student working on PARC’s Internet ecology project found that any two Web sites are no more than four clicks away from each other — hard evidence that the world is smaller than it seems, on the Web at least.

Research on this scale and of this complexity makes new thinking possible in a whole range of fields, from graph theory to sociology. Pitkow compares what’s happening to the Einstein-era thrust past the limitations of Newtonian physics into quantum mechanics: "The Web," he says, "requires a whole new form of understanding."

News coverage and further information:

Xerox PARC Internet Ecologies Area
http://www.parc.xerox.com/iea

"Shall I Compare Thee to a Swarm of Insects? Searching for the Essence of the World Wide Web"
George Johnson, New York Times, 11 April 1999
(Use keywords in the title to search for the article. You must be an online subscriber and pay a fee to read the article.)

"Sociologie du cybermonde: Une équipe du Palo Alto Research Center s'est lancée dans une étude des lois comportementales sur l'Internet" (in French)
Patrick Sabatier, Libération, 27 July 1999

"Le web se ‘balkanise.’ Seul 1% des sites draine 55% des internautes!" (in French)
Pierre-Philippe Cadert, Webdo, August 1999

Top of Page

IBM: Research Projects

"The La Brea Tar Pits of Our Age"

Inside the building where high-performance, large-capacity storage disks were invented, researchers at IBM's Almaden Research Center are developing software that deals intelligently with large masses of data. Using a "crawl," or snapshot of the Web, from the Internet Archive, they've developed successors to Intelligent Data Miner, a program that sorts and indexes large amounts of raw data.

The software is useful for mundane tasks like properly routing email to sales, tech support, and other departments. But IBM research associate Bruce Baumgart and his colleagues have also used it — along with a large body of data like the crawl from the Internet Archive — to find out how Web sites point to one another and form communities of common interest. Baumgart says that "unleashing" the IDM on the Archive's data reveals "clusters of activity.... You see what was hot, what the breaking story was, say, on a given date two years ago."

Baumgart compares the Archive with the La Brea tar pits — a large deposit of pitch in the middle of Los Angeles, where paleontologists make important discoveries as they dig up the fossils of creatures and plants that fell into the pits during the Ice Age. As people begin to understand how projects like the Internet Archive benefit communities, Baumgart believes that demand for lasting storage media will grow — in contrast to the current market for media that last only a few years before the stored data begins to degrade.

Top of Page

AT&T: Research Projects

Dilbert Versus Doonesbury

When it comes to research, AT&T researcher Balachander Krishnamurthy sees plenty of advantages in a library like the Internet Archive.

First, the Archive eliminates the need for researchers to develop their own "crawlers" (software that search engines and others use to gather Web pages). This saves researchers time and expense, and without what Krishnamurthy refers to as the "mental stumbling block" of development, they can test new ideas quickly.

For example, using a ready-made Internet library like the Archive, a researcher interested in the popularity of dilbert.com compared to doonesbury.com could analyze links to those sites in minutes. (Search engines like Google use similar methods to return high-quality search results.) By contrast, a researcher could spend weeks just developing a spider and crawling enough Web servers to get adequate data. If the researcher wanted to compare crawls over time — for example, to look at public use of government information by comparing the most-linked-to dot-gov sites over the course of an election campaign — weeks would turn into months.

Besides, "the algorithms for good crawling aren't published, so you're better off getting someone else's crawl if you just want access to data," says Krishnamurthy, adding that "the Archive has high-quality crawls."

Krishnamurthy emphasizes that using the Archive's collections results in other efficiencies too. Not only does easy, central access to Web data spare trouble, expense, and time for researchers — but "when fewer crawlers are at work on the Web," he says, "it reduces the load on individual site servers and on the Internet in general."

Furthermore, searches of the nonprofit Archive's collections are anonymous, whereas other searches may not be. For example, says Krishnamurthy, "if someone did a patent search on a site hosted by a private corporation, they could potentially reveal valuable information about themselves and the topics they were interested in, maybe to a competitor."

One of the Archive's most important benefits is a traditional one. The Archive's open access lets researchers engage in a fundamental scientific practice: replicating colleagues' experiments and performing valid, publishable comparisons of the results. And in an open, worldwide environment like the Internet, where virtually anyone can build new software, the Archive is a practical environment for testing whether the software conforms to Internet standards. "Suppose someone had an idea for a different compression algorithm for hypertext transfer protocol," Krishnamurthy suggests. "Is it worth it to develop it or not?" An analysis using the Archive's Web collection would provide a quick answer.

Top of Page

NEC: Research Projects

The Internet Archive: Better Than a Search Engine

What kinds of information do people and organizations put on the Web? How are different kinds of information distributed, and how accessible are they? NEC researchers Dr. Steve Lawrence, Dr. Lee Giles, and Dr. Gary Flake are using the Internet Archive's Web collection to study the social, political, economic, and scientific implications of this important new medium. Lawrence calls the Internet Archive "an invaluable resource that helps us to understand the Internet and maximize the benefits of the information age for society."

Among their discoveries so far: Of all Web servers, 83 percent are used for commercial purposes, 6 percent for scientific or educational purposes. Pornography accounts for the use of only 1.5 percent of servers.

The researchers also found that the coverage of Web search engines is significantly limited, as is the speed with which they index new or modified pages. Furthermore, investigating the question of access to information, the researchers found that Web search engines are more likely to index more popular pages. According to Lawrence, this fact highlights the importance of the Internet Archive: Not only does the Archive preserve the Web, it reflects the Web more accurately than search engines do.

Lawrence and his colleagues are currently using the Internet Archive to experiment with a new, efficient algorithm for identifying communities on the Web. Given a sample of Web pages within a community, the algorithm exploits the link structure in the Archive's Web collection to find all sites within the same community. "Using the Archive's collection, we can investigate algorithms that would be impossible to investigate on the Web itself [in its distributed network context]," says Flake. "That kind of resource will be key to designing the next generation of Web search engines."

The results of NEC studies have been reported widely in the press, including the New York Times, the Wall Street Journal, the Washington Post, Reuters, AP, UPI, CNN, the BBC, MSNBC, CBS, and NPR.

Top of Page

Federal Government Information Clearinghouse: GILS Initiative

The US Federal Government Information Clearinghouse is implementing a common standard for government information and services. The standard, the Global Information Locator Service, aims to make it easier for people to find information of all kinds, in all media, in all languages, and over time.

The Clearinghouse lists the Internet Archive as a partner in its efforts. Some federal Clearinghouse "portals" that have already been built include the Government Printing Office access facility, the National Spatial Data Infrastructure Clearinghouse for Geospatial Data, and the National Biological Information Infrastructure Metadata Clearinghouse. Among those likely to be built soon are portals to the Department of Energy, the Library of Congress, NASA, the National Library of Medicine, the National Oceanic and Atmospheric Administration, the Patent and Trademark Office, and the US Geological Survey.

More information on the Clearinghouse and the Global Information Locator Service:

A Partner’s Guide to the US Federal Government Information Clearinghouse

Global Information Locator Service Web site

Top of Page

FTP Sites in the Archive

DATES:
July to October 1996
SIZE:
0.05 terabyte (about 50,000 sites)
RATE OF GROWTH:
N/A; collection of FTP sites is currently on hold
ACCESSIBLE:
No access at present; we hope to move the collection from (slow) tape to (faster) disk and provide access during 2000
ACCESS:
See the Archive’s Terms of Use

Top of Page

Usenet Bulletin Boards in the Archive

DATES:
October 1996 to late 1998
SIZE:
0.592 terabyte (about 16 million postings)
RATE OF GROWTH:
N/A; collection of Usenet bulletin boards is currently on hold
ACCESSIBLE:
No access at present; we hope to move the collection from (slow) tape to (faster) disk and provide access during 2000. Try www.deja.com, which maintains a more complete collection
ACCESS:
See the Archive’s Terms of Use

Find out
How we acquire and store the collections
How to donate a digital collection to the Internet Archive
How to subscribe to Archivists, our discussion list on Internet libraries

Top of Page