Old Dominion U. Researchers Ask How Much of the Web Is Archived

Researchers at Old Dominion U. in Virginia are trying to figure out what percentage of the Web is archived by sampling from the four different sources shown here.

Researchers at Old Dominion University in Virginia are trying to figure out how much of the public Web is archived and who is storing it, as part of a larger effort to preserve the digital record.

Michael L. Nelson, a computer-science professor, has been working with professors and students since September to determine how much of the Web’s history has been preserved in Internet databases around the world.

Mr. Nelson’s team estimated the percentage of 4,000 Web pages that were archived by sampling data known as URI’s, or uniform-resource identifiers. An identifier is a label for a specific Web-page address or name. The researchers used Memento, a browser plug-in they developed in 2009, to find old versions of the pages across various Internet archives.

The URI’s were compiled from various sources: from search-engine caches from Google, Bing, and Yahoo!, from an Internet archive called the Open Directory Project, from a link-sharing service called Delicious, and from a Web-address-shortening service called Bitly.

The report showed that 35 percent to 90 percent of Web pages have at least one archived copy and that the chance of a page being archived depended on the source. For instance, URI’s gathered from Delicious were much more likely to be archived than Bitly URI’s, but the reason for that is not entirely clear. Mr. Nelson plans to continue the project, as he felt that no “final answer” had yet been reached.

Alexis Rossi, the Web-collections manager at Internet Archive, found the university’s efforts interesting, but she wondered whether it is even possible to accurately assess archival rates in a continually changing landscape.

“It’s such a moving target—the Web is expanding all the time,” Ms. Rossi said. Internet Archive was one of several archives used in the study and has been preserving the Web since 1996.

“People are coming to the realization that if nobody saves the Internet, their work will just be gone,” Ms. Rossi said. She also said the project may shed light on the efficacy of Web archiving as libraries and Internet users begin to think more about preserving the Web.

For Mr. Nelson, the study is another step toward creating a browsing experience that links the past to the present: where users can replay events as they unfolded, such as media coverage of hurricane Katrina in 2005 or 2007’s Virginia Tech shootings.

“You relive the experience in a way that a summary page can’t even begin to capture,” Mr. Nelson said, imagining a day when such historical searches become common.

Scott G. Ainsworth, the project’s lead student researcher, compared saving old Web pages to the historical preservation of old Sears catalogs. “You never know what’s going to be important in 100 or 150 years,” he said.

Return to Top