Webcite links provide access to archived copy of linked web pages

Matthew Cockerill 17 Sep 2007

Anyone who has tried to follow web links in an scientific article published several years ago will be familiar with the problem. You click a link, only to get a ‘Server not responding’ message, or a ‘Page not found’ error.

This lack of permanence of web links (sometimes known as link rot) is a general phenomenon across the web, but it is a particular problem in the case of published scientific research. On the one hand, the coherence of the published scientific record depends on being able to refer back to the articles including the online material that they refer to. But on the other hand, the character of scientific research projects (which tend to be funded for a few years at a time) and of scientific careers (which tend to involved frequent shifts between institutions) mean that scientific web pages become inaccessible with worrying regularity.

In this electronic age, it is not realistic to expect authors to refrain entirely from mentioning web pages in their articles, ephemeral as they may be. So, since late 2005, BioMed Central has been working in partnership with the WebCite initiative, based at the Centre for Global eHealth Innovation at Toronto General Hospital, to preserve archival copies of all web pages linked to from BioMed Central articles.

Wherever you see a logo, whether in the body of an article, or in the reference section, you can click on that link to view a version of that page that has been archived at WebCite.

For papers published since 2006, this archived copy will have been harvested immediately after publication, and so another benefit of this process, as well as providing some degree of digital permanence, is that it allows you to view the web page as it was at the time of publication.

For example, this Journal of Biology article links to the BioGRID database. The WebCite copy provides a snapshot of the BioGRID home page, including stats on the database, as it was at the time of publication.

WebCite is not, by itself, a perfect solution. Snapshots of web pages such as those preserved by WebCite cannot fully replicate the functionality of a complex database-driven web site. Even single web pages may in some cases cause problems for the WebCite archiving robot, but this is improving all the time (please let us know if you spot any problems). Lastly, in order to provide long term digital permanence, it is important that the WebCite project itself should have long term sustainable support. To this end, we encourage other publishers to participate in the initiative, and to consider ways of supporting it, perhaps via a similar collective model as that used for the CrossRef linking initiative.

The caveats notwithstanding, a basic principle of digital archiving is that the sooner you start, the less you lose (as the Internet Archive has demonstrated). So we are very pleased to be working with WebCite to ensure that as much of possible of the web material linked to by BioMed Central authors is preserved for the long term.

View the latest posts on the Research in progress blog homepage

3 Comments

mr. gunn 24th November 2007 08:48

Gunther and I have been discussing WebCite and he referred me here. I have some concerns about the sustainability and availability of a centralized solution such as WebCite. I would love to hear how BMC addressed these concerns.

matthew cockerill 26th November 2007 03:29

Dear Mr Gunn,
To address your key points:

(1) Does a DOI-style redirection model offer an effective or even superious alternative to a WebCite style solution?

DOIs, PURLs and Handles are useful approaches to addressing the problem of digital content whose URL may change with time. Though it is non-trivial to ensure that such PURLs are kept up to date and do indeed point to the appropriate current location.
But they do not address two of the other key issues addressed by WebCite –
(a) some content simply disappears from the web – e.g. when a user’s personal directory is deleted after they leave an instution. No indirection mechanism can bring this data back once it has gone.
(b) Many web pages change over time, and Webcite’s preservation of the web page ‘as it was’ around the time of publication is particularly useful (for example, in the case of links to the NIH Public Access Policy page, which has clear potential to change over time).

(2) Does the existence of existing systems such for archiving web pages archive.org negate the need for Webcite?

WebCite is focused on systematically harvesting and preserving URLs referenced in scientific articles, and is able to do this by setting up relationships with publishers to take a feed of their content and scan it for URL links. By doing this it is able to ensure that it has a version of just about every cited web page as it was at the time of publication.
This is very different from Archive.org’s harvesting which is much broader and less systematic, and so provides much less comprehensive coverage of the sort of URLs that are covered by WebCite. However, the possibility WebCite might be able to work with Archive.org to make use of some of Archive.org’s technical infrastructure is an interesting one which I believe is currently being explored.

(3)WebCite operates a centralized harvesting model – does that mean that it cannot deliver long term access to archived content?
Instead of WebCite, why don’t we rely on indivudual authors and/or journals to take the initiative and archive copies of any web page they feel to be important.

We have to deal with the world as it is, rather than as we would like it to be. Without WebCite, thousands of links in BioMed Central articles were becoming inaccessible each year, and it seems unlikely that burdening authors with the task of manually downloading and preserving copies of web pages would have been an effective way to address that.
Webcite, meanwhile, has successfully kept copies of many of the pages that have disappeared in the last 2 years. Because these pages remain available, we can discuss the best way to preserve them for the long term. But without WebCite in its current form, many of them would simply be gone for good. With digital preservation, it’s important to act quickly, because the longer you ponder the most perfect long-term solution, the more data is being lost. If you can preserve things for the medium term, you buy time to address these longer term issues.

At a basic level, discussions can now take place with organizations (such as Archive.org, and CrossRef) who may be able to provide robust international mirroring arrangements with WebCite so that WebCite-archived content is not vulnerable to loss through human error, naturual or man-made disaster, or simple loss of project funding. But for real long term presevation, radical approaches may be necessary.

mark nanyingi 27th May 2008 02:18

Biomed central is an excellent platform for providing an opportunity to scientists in developing countries to share their wealth of knowledge that could otherwise be untapped

Comments are closed.

Research in progress blog

Webcite links provide access to archived copy of linked web pages

3 Comments

Matthew Cockerill

Latest posts by Matthew Cockerill (see all)

Popular posts

Most Shared Posts

Archives

3 Comments

Matthew Cockerill

Latest posts by Matthew Cockerill (see all)

Popular BioMed Central blog tags

Popular posts

Most Shared Posts

Archives