Showing posts with label 404 error. Show all posts
Showing posts with label 404 error. Show all posts

Tuesday, March 10, 2015

2015-03-10: Where in the Archive Is Michele Weigle?

(Title is an homage to a popular 1980s computer game "Where in the World Is Carmen Sandiego?")

I was recently working on a talk to present to the Southeast Women in Computing Conference about telling stories with web archives (slideshare). In addition to our Hurricane Katrina story, I wanted to include my academic story, as told through the archive.

I was a grad student at UNC from 1996-2003, and I found that my personal webpage there had been very well preserved.  It's been captured 162 times between June 1997 and October 2013 (https://web.archive.org/web/*/http://www.cs.unc.edu/~clark/), so I was able to come up with several great snapshots of my time in grad school.

https://web.archive.org/web/20070912025322/
http://www.cs.unc.edu/~clark/
Aside: My UNC page was archived 20 times in 2013, but the archived pages don't have the standard Wayback Machine banner, nor are their outgoing links re-written to point to the archive. For example, see https://web.archive.org/web/20130203101303/http://www.cs.unc.edu/~clark/
Before I joined ODU, I was an Assistant Professor at Clemson University (2004-2006). The Wayback Machine shows that my Clemson home page was only crawled 2 times, both in 2011 (https://web.archive.org/web/*/www.cs.clemson.edu/~mweigle/). Unfortunately, I no longer worked at Clemson in 2011, so those both return 404s:


Sadly, there is no record of my Clemson home page. But, I can use the archive to prove that I worked there. The CS department's faculty page was captured in April 2006 and lists my name.

https://web.archive.org/web/20060427162818/
http://www.cs.clemson.edu/People/faculty.shtml
Why does the 404 show up in the Wayback Machine's calendar view? Heritrix archives every response, no matter the status code. Everything that isn't 500-level (server error) is listed in the Wayback Machine. Redirects (300-level responses) and Not Founds (404s) do record the fact that the target webserver was up and running at the time of the crawl.

Wouldn't it be cool if when I request a page that 404s, like http://www.cs.clemson.edu/~mweigle/, the archive could figure out that there is a similar page (http://www.cs.unc.edu/~clark/) that links to the requested page?
https://web.archive.org/web/20060718131722/
http://www.cs.unc.edu/~clark/
It'd be even cooler if the archive could then figure out that the latest memento of that UNC page now links to my ODU page (http://www.cs.odu.edu/~mweigle/) instead of the Clemson page. Then, the archive could suggest http://www.cs.odu.edu/~mweigle/ to the user.

https://web.archive.org/web/20120501221108/
http://www.cs.unc.edu/~clark/
I joined ODU in August 2006.  Since then, my ODU home page has been saved 53 times (https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/).

The only memento from 2014 is on Aug 9, 2014, but it returns a 302 redirecting to an earlier memento from 2013.



It appears that Heritrix crawled http://www.cs.odu.edu/~mweigle (note the lack of a trailing /), which resulted in a 302, but http://www.cs.odu.edu/~mweigle/ was never crawled. The Wayback Machine's canonicalization is likely the reason that the redirect points to the most recent capture of http://www.cs.odu.edu/~mweigle/. (That is, the Wayback Machine knows that http://www.cs.odu.edu/~mweigle and http://www.cs.odu.edu/~mweigle/ are really the same page.)

My home page is managed by wiki software and the web server does some URL re-writing. Another way to get to my home page is through http://www.cs.odu.edu/~mweigle/Main/Home/, which has been saved 3 times between 2008 and 2010. (I switched to the wiki software sometime in May 2008.) See https://web.archive.org/web/*/http://www.cs.odu.edu/~mweigle/Main/Home/

Since these two pages point to the same thing, should these two timemaps be merged? What happens if at some point in the future I decide to stop using this particular wiki software and end up with http://www.cs.odu.edu/~mweigle/ and http://www.cs.odu.edu/~mweigle/Main/Home/ being two totally separate pages?

Finally, although my main ODU webpage itself is fairly well-archived, several of the links are not.  For example, http://www.cs.odu.edu/~mweigle/Resources/WorkingWithMe is not archived.


Also, several of the links that are archived have not been recently captured.  For instance, the page with my list of students was last archived in 2010 (https://web.archive.org/web/20100621205039/http://www.cs.odu.edu/~mweigle/Main/Students), but none of these students are still at ODU.

Now, I'm off to submit my pages to the Internet Archive's "Save Page Now" service!

--Michele

Monday, October 27, 2014

2014-10-27: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent

Herbert and I attended the "404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent" at the Georgetown Law Library on Friday, October 24, 2014.  Although the origins for this workshop are many, catalysts for it probably include the recent Liebler  & Liebert study about link rot in Supreme Court opinions,  and the paper by Zittrain, Albert, and Lessig about Perma.cc and the problem of link rot in the scholarly and legal record and the resulting popular media coverage resulting from it  (e.g., NPR and the NYT). 

The speakers were naturally drawn from the legal community at large, but some notable exceptions included David Walls from the GPO, Jefferson Bailey from the Internet Archive, and Herbert Van de Sompel from LANL. The event was streamed and recorded, and videos + slides will be available from the Georgetown site soon so I will only hit the highlights below. 

After a welcome from Michelle Wu, the director of the Georgetown Law Library, the workshop started with an excellent keynote from the always entertaining Jonathan Zittrain, called "Cites and Sites: A Call To Arms".  The theme of the talk centered around "Core Purpose of .edu", which he broke down into:
  1. Cultivation of Scholarly Skills
  2. Access to the world's information
  3. Freely disseminating what we know
  4. Contributing actively and fiercely to the development of free information platforms



For each bullet he gave numerous anecdotes and examples; some innovative, and some humorous and/or sad.  For the last point he mentioned Memento, Perma.cc, and timed release crypto

Next up was a panel with David Walls (GPO), Karen Eltis (University of Ottawa), and Ed Walters (Fastcase).  David mentioned the Federal Depository Library Program Web Archive, Karen talked about the web giving us "Permanence where we don't want it and transience where we require longevity" (I tweeted about our TPDL 2011 paper that showed for music videos on Youtube, individual URIs die all the time but the content just shows up elsewhere), and Ed generated a buzz in the audience when he announced that in rendering their pages they ignore the links because of the problem of link rot.  (Panel notes from Aaron Kirschenfeld.)

The next panel had Raizel Liebler (Yale) author of another legal link rot study mentioned above and an author of one of the useful handouts about links in the 2013-2014 Supreme Court documentsRod Wittenberg (Reed Tech) talked about the findings of the Chesapeake Digital Preservation Group and gave a data dump about link rot in Lexis-Nexis and the resulting commercial impact (wait for the slides).  (Panel notes from Aaron Kirschenfeld.)

After lunch, Roger Skalbeck (Georgetown) gave a web master's take on the problem, talking about best practices, URL rewriting, and other topics -- as well as coining the wonderful phrase "link rot deniers".  During this talk I also tweeted TimBL's classic 1998 resource "Cool URIs Don't Change". 

Next was Jefferson Bailey (IA) and Herbert.  Jefferson talked about web archiving, the IA, and won approval from the audience for his references to Lionel Hutz and HTTP status dogs.  Herbert's talk was entitled "Creating Pockets of Persistence", and covered a variety of topics, obviously including Memento and Hiberlink.




The point is to examine web archiving activities with an eye to the goal of making access to the past web:
  1. Persistent
  2. Precise
  3. Seamless
Even though this was a gathering of legal scholars, the point was to focus on technologies and approaches that are useful across all interested communities.  He also gave examples from our "Thoughts on Referencing, Linking, Reference Rot" (aka "missing link) document, which was also included in the list of handouts.  The point on this effort is enhance existing links (with archived versions, mirror versions, etc.), but not at the expense of removing the link to the original URI and the datetime of intended link.  See our previous blog post on this paper and a similar one for Wikipedia.

The closing session was Leah Prescott (Georgetown; subbing for Carolyn Cox),  Kim Dulin (Harvard), and E. Dana Neacşu (Colombia).   Leah talked some more about the Chesapeake Digital Preservation Group and how their model of placing materials in a repository doesn't completely map to the Perma.cc model of web archiving (note: this actually has fascinating implications for Memento that are beyond the scope of this post).  Kim gave an overview of Harvard's Perma.cc archive, and Dana gave an overview of a prior archiving project at Columbia.  Note that Perma.cc recently received a Mellon Foundation grant (via Columbia) to add Memento capability.

Thanks to Leah Prescott and everyone else that organized this event.  It was an engaging, relevant, and timely workshop.  Herbert and I met several possible collaborators that we will be following up with. 




Resources:

-- Michael

Thursday, December 19, 2013

2013-12-19: 404 - Your interview has been depublished

Early November 2013 I gave an invited presentation at the EcoCom conference (picture left) and at the Spreeeforum, an informal gathering of researchers to facilitate knowledge exchange and foster collaborations. EcoCom was organized by Prof. Dr. Michael Herzog and his SPiRIT team and the Spreeforum was hosted by Prof. Dr. Jürgen Sieck who leads the INKA research group. Both events were supported by the Alcatel-Lucent Stiftung for Communications research. In my talks I gave a high-level overview of the state of the art in web archiving, outlined the benefits of the Memento protocol, pointed at issues and challenges web archives face today, and gave a demonstration of the Memento for Chrome extension.

Following the talk at the Spreeforum I was asked to give an interview for the German radio station Inforadio (you may think of it as Germany's NPR). The piece was aired on Monday, November 18th at 7.30am CET. As I had left Germany already I was not able to listen to it live but was happy to find the corresponding article online that basically contained the transcript of the aired report and an audio file was embedded in the document. I immediately bookmarked the page.

A couple of weeks later I revisited the article at its original URI only to find it was no longer available (screenshot left). Now, we all know that the web is dynamic and hence links break and even we have seen odd dynamics at other media companies before but in this case, as I was about to find out, it was higher powers that caused the detrimental effect. Inforadio is a public radio station and therefore, as many others in Germany and throughout Europe, to a large extent financed by the public (as of 2013 the broadcast receiving license is 17.98 Euros (almost USD 25) per month per household). As such they are subject to the "Rundfunkstaatsvertrag", which is a contract between the German states to regulate broadcasting rights. The 12th amendment to this contract from 2009 mandates that most online content must be removed after 7 days of publication. Huh? Yeah, I know, it sounds like a very bad joke but it is not. It even lead to coining the term "depublish" - a paradox by itself. I had considered public radio stations as "memory organizations", in league with libraries, museums, etc. How wrong was I and how ironic is this, given my talk's topic!? For what it's worth though, the content does not have to be deleted from the repository but it has to be taken offline.

I can only speculate about the reasons for this mandate but to me believable opinions circulate indicating  that private broadcasters and news publishers complained about unfair competition. In this sense, the claim was made that "eternal" availability of broadcasted content on the web is unfair competition as the private sector is not given the appropriate funds to match that competitive advantage. Another point that supposedly was made is that this online service goes beyond the mandate of public radio stations and hence would constitute a misguided use of public money. To me personally, none of this makes any sense. Broadcasters of all sorts have realized that content (text, audio, and video) is increasingly consumed online and hence are adjusting their offerings. How this can be seen as unfair competition is unclear to me.

But back to my interview. Clearly, one can argue (or not) whether the document is worth preserving but my point here is a different one:
Not only did I bookmark the page when I saw it, I also immediately tried to push it into as many web archives as I could. I tried the Internet Archive's new "save page now" service but, to add insult to injury, Inforadio also has a robots.txt file in place that prohibits the IA from crawling the page. To the best of my knowledge this is not part of the 12th amendment to the "Rundfunkstaatsvertrag" so the broadcaster could actually take action to preserve their online content. Other web sites of public radio and TV stations such as Deutschlandfunk or ZDF do not prohibit archives from crawling their pages.



Fortunately, the archiving service Archive.is was able to grab the page (screenshot left) but the audio feed is lost.



Just one more thing (Peter Falk style):
Note that the original URI of the page:

http://www.inforadio.de/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

when requested in a web browser redirects (200-style) to:

http://www.inforadio.de/error/404.html?/rbb/inf/programm/schema/sendungen/netzfischeer/201311/vergisst_das_internet.html

The good news here: it is not a soft 404 so the error is somewhat robot friendly. The bad news is that the original URI is thrown away. As the original URI is the only key for a search in web archives, we can not retrieve any archived copies (such as the one I created in Archive.is) without it. Unfortunately, this is not only true for manual searches but it also undermines automatic retrieval of archives copies by clients such as the browser extension Memento for Chrome. As stressed in our recent talk at CNI this is very bad practice and unnecessarily makes life harder for those interested in obtaining archived copies of web pages at large, not only my radio interview.

--
Martin

Friday, December 13, 2013

2013-12-13: Hiberlink Presentation at CNI Fall 2013

Herbert and Martin attended the recent Fall 2013 CNI meeting in Washington DC, where they gave an update about the Hiberlink Project (joint with the University of Edinburgh), which is about preserving the referential integrity of the scholarly record. In other words, we link to the general web in our technical publications (and not just other scholarly material) and of course the links rot over time.  But the scholarly publication environment does give us several hooks to help us access web archives to uncover the correct material. 

As always, there are many slides but they are worth the time to study them.  Of particular importance are slides 8--18, which helps differentiate Hiberlink from other projects, and slides 66-99 which walk through a demonstration of the "Missing Link" concepts (along with the Memento for Chrome extension) can be used to address the problem of link rot.  In particular, absent specific versiondate attributes on a link, such as:

<a versiondate="some-date-value" href="...">

A temporal context can be inferred from the "datePublished" META value defined by schema.org:

<META itemprop="datePublished" content="some-ISO-8601-date-value">



Again, the slides are well-worth your time.

--Michael


Thursday, November 21, 2013

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

Circulating the web last week the story of the UK's Conservative Party (aka the "Tories") removing speeches from their website (see Note 1 below).  Not only did they remove the speeches from their website, but via their robots.txt file they also blocked the Internet Archive from serving their archived versions of the pages as well (see Note 2 below of a discussion of robots.txt, as well as for an update about availability in the Internet Archive).  But even though the Internet Archive allows site owners to redact pages from their archive, mementos of the pages likely exist in other archives.  Yes, the Internet Archive was the first web archive and is still by far the largest with 240B+ pages, but the many other web archives, in aggregate, also provide good coverage (see our 2013 TPDL paper for details). 

Consider this randomly chosen 2009 speech:

http://www.conservatives.com/News/Speeches/2009/11/David_Cameron_The_Big_Society.aspx

Right now it produces a custom 404 page (see Note 3 below):


Fortunately, the UK Web Archive, Archive-It (collected by the University of Manchester), and Archive.is all have copies (presented in that order):




So it seems clear that this speech will not disappear down a memory hole.  But how do you discover these copies in these archives?  Fortunately, the UK Web Archive, Archive-It, and Archive.is (as well as the Internet Archive) all implement Memento, an inter-archive discovery framework.  If you use a Memento-enabled client such as the recently released Chrome extension from LANL, the discovery is easy and automatic as you right-click to access the past.

If you're interested in the details, the Memento TimeMap lists the four available copies (Archive-It actually has two copies):



The nice thing about the multi-archive access of Memento is as new archives are added (or in this case, if the administrators at conservatives.com decide to unredact the copies in the Internet Archive), the holdings (i.e., TimeMaps) are seamlessly updated -- the end-user doesn't have to keep track of the dozens of public web archives and manually search them one-at-a-time for a particular URI. 

We're not sure how many of the now missing speeches are available in these and other archives, but this does nicely demonstrate the value of having multiple archives, in this case all with different collection policies:
  • Internet Archive: crawl everything
  • Archive-It: collections defined by subscribers
  • UK Web Archive: archive all UK websites (conservatives.com is a UK web site even though it is not in the .uk domain)
  • Archive.is: archives individual pages on user request
Download and install the Chrome extension and all of these archives and more will easily available to you.

-- Michael and Herbert

Note 1: According to this BBC report, the UK Labour party also deletes material from their site, but apparently they don't try to redact from the Internet Archive via robots.txt.  For those who are keeping score, David Rosenthal regularly blogs about the threat of governments altering the record (for example, see: June 2007, October 2010, July 2012, August 2013).  "We've always been at war with Eastasia."

Note 2: In the process of writing this blog, the Internet Archive is no longer blocking access to this speech (and presumably the others).  Here is the raw HTTP of the speech being blocked (the key is the line with "X-Archive-Wayback-Runtime-Error:" line):



But access was restored sometime in the space of three hours before I could generate a screen shot:



Why was it restored?  Because the conservatives.com administrators changed their robots.txt file on November 13, 2013 (perhaps because of the backlash from the story breaking?).  The 08:36:36 version of robots.txt has:

...
Disallow: /News/News_stories/2008/
Disallow: /News/News_stories/2009/
Disallow: /News/News_stories/2010/01/
... 

But the 18:10:19 version has:
...  
Disallow: /News/Blogs.aspx
Disallow: /News/Blogs/
...  

These "Disallow" rules no longer match the URI of the original speech.  I guess the Internet Archive cached the disallow rule and it just now expired one week later.  See the IA's exclusion policy for more information about their redaction policy and robotstxt.org for details about syntax.

The TimeMap from the LANL aggregator is now current with 28 mementos from the Internet Archive and 4 mementos from the other three archives. We're keeping the earlier TimeMap above to illustrate how the Memento aggregator operates; the expanded TimeMap (with the Internet Archive mementos) is below:



Note 3: Perhaps this is a Microsoft-IIS thing, but their custom 404 page, while pretty, is unfortunate.  Instead of returning a 404 page at the original URI (like Apache), it 302 redirects to another URI that returns the 404:



See our 2013 TempWeb paper for a discussion about redirecting URI-Rs and which values to use as keys when querying the archives.

--Michael

Friday, November 8, 2013

2013-11-08: Proposals for Tighter Integration of the Past and Current Web

The Memento Team is soliciting feedback on two white papers that address related proposals for more tightly integrating the past and current web.

The first is "Thoughts on Referencing, Linking, Reference Rot", which is inspired by the hiberlink project.  This paper proposes making temporal semantics part of the HTML <a> element, via "versiondate" and "versionurl" attributes that respectively include the datetime the link was created and optionally a link to an archived version of the page (in case the live web version becomes 404, goes off topic, etc.).  The idea is that "versiondate" can be used as a Memento-Datetime value by a client, and "versionurl" can be used to record a URI-M value.  This approach is inspired by the Wikipedia Citation Template, which has many metadata fields, including "accessdate" and "archiveurl".  For example, in the article about the band "Coil", one of the links to the source material is broken, but the Citation Template has values for both "accessdate" and "archiveurl":



Unfortunately, when this is transformed into HTML the semantics are lost or relegated to microformats:



A (simple) version with machine-actionable links suitable for the Memento Chrome extension or Zotero could have looked like this in the past, ready to activate when the link eventually went 404:



The second paper, "Memento Capabilities for Wikipedia", "describes added value that Memento can bring to Wikipedia and other MediaWiki platforms.  One is enriching their external links with the recommendations from our first paper (described above), and the second is about native Memento support for wikis.

Native Memento support is possible via a new Memento Extension for MediaWiki servers that we announced for testing and feedback on the wikitech-l list. This new extension is the result of a significant re-engineering effort guided by feedback received from Wikipedia experts to a previous version.  When installed, this extension allows clients to access the "history" portion of wikis in the same manner as they access web archives.  For example, if you wanted to visit the Coil article as it existed on February 2, 2007 instead of wading through the many pages of the article's history, your client would use the Memento protocol to access a prior version with the "Accept-Datetime" request header:



and the server would eventually redirect you to:



In a future blog post we will describe how using a Memento-enabled wiki can be used to avoid spoilers on fan wikis (e.g., The Songs of Ice and Fire wiki) by setting the Accept-Datetime to be right before a episode or book is released.

We've only provided a summary of the content of the two papers and we invite everyone to review them and provide us with feedback (here, twitter, email, etc.). 

--Michael & Herbert

Saturday, November 10, 2012

2012-11-10: Site Transitions, Cool URIs, URI Slugs, Topsy

Recently I was emailing a friend and wanted to update her about the recent buzz we have enjoyed with Hany SalahEldeen's TPDL 2012 paper about the loss rate of resources shared over Twitter.  I remembered that an article in the MIT Technology Review from the Physics arXiv blog started the whole wave of popular press (e.g., MIT Technology Review, BBC, The Atlantic, Spiegel).  To help convey the amount of social media sharing of these stories, I was sending links to the sites using social media search engine Topsy.  Having recently discovered it, Topsy has quickly become one of my favorite sites.  It does many things, but the part I enjoy most is the ability to prepend "http://topsy.com/" to a URI to discover how many times a URI has been shared and who is sharing it.  For example:

http://www.bbc.com/future/story/20120927-the-decaying-web

becomes:

http://topsy.com/http://www.bbc.com/future/story/20120927-the-decaying-web

and you can see all the tweets that have linked to the bbc.com URI. 

While composing my email I recalled the Technology Review article was the one of the first (September 19, 2012) and most popular, so I did a Google search for the article and converted the resulting URI from:

http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

to:

http://topsy.com/http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/

I was surprised when I saw Topsy reported 0 posts about the MIT TR story, because I recalled it being quite large.  I thought maybe it was a transient error and didn't think too much about it until later that night when I was on my home computer where I had bookmarked the MIT TR Topsy URI and it said "900 posts".  Then I looked carefully: the URI I had bookmarked now issues a 301 redirection to another URI:

% curl -I http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352561072"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 15:24:32 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 15:24:32 GMT
X-Varnish: 1779081554
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


A little poking around revealed that technologyreview.com reorganized and rebranded their site on October 24, 2012, and Google had already swapped the prior URI to the article with the new URI.  Their site uses Drupal and it appears their old site did as well but the URIs have changed.  The base URIs (e.g., http://www.technologyreview.com/view/429274/) have stayed the same (and is thus almost "cool"), but the slug has lengthed from 8 terms ("history as recorded on twitter is vanishing from") to the full title ("history as recorded on twitter is vanishing from the web say computer scientists").  Slugs are a nice way to make the URI more human readable, and can be useful in determining what the URI was "about" if (or when) it becomes 404 (see also Martin Klein's dissertation on lexical signatures).  The base URI will 301 redirect to the URI with the slug:

% curl -I http://www.technologyreview.com/view/429274/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352563816"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 16:10:16 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 16:10:16 GMT
X-Varnish: 1779473907
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


But this redirection is transparent to the user, so all the tweets that Topsy analyzes are the versions with slugs.  This results in two URIs for the article: the version from Sept 19 -- Oct 24 that has 900 tweets, and the Oct 24 -- now version that currently has 3 tweets (up from 0 when I first noticed this).  technologyreview.com is to be commended for not breaking the pre-update URIs (see the post about how ctv.ca handled a similar situation) and issuing 301 redirections to the new versions, but it would have been prefereable to have maintained the old URIs completely (perhaps the new software installation has a different default slug length, I'm not familiar with Drupal and in the code examples I can find a limit is not defined). 

Splitting PageRank with URI aliases is a well-known problem that can be addressed with 301 redirects (e.g., this is why most URI shorteners like bitly issue 301 redirects (instead of 302s), so the PageRank will accumulate at the target and not the short URI).  It would be nice if Topsy also merged redirects when computing their pages.  In the example above, that would result in either of the Topsy URIs (pre- and post-October 24) reporting 900+3 = 903 posts (or at least provided that as an option).  

--Michael

Edit: I did some more investigating and found that the slug doesn't matter, only the Drupal node ID of "429274" (those familiar with Drupal probably already knew that).  Here's a URI that should obviously return 404 redirecting to URI with the full title as the slug:

% curl -I http://www.technologyreview.com/view/429274/lasdkfjlajfdsljkaldsf/
HTTP/1.1 301 Moved Permanently
Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0
ETag: "1352581871"
Content-Language: en
Last-Modified: Sat, 10 Nov 2012 21:11:11 GMT
Location: http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter-is-vanishing-from-the-web-say-computer-scientists/
X-AH-Environment: prod
Vary: Accept-Encoding
Content-Length: 0
Date: Sat, 10 Nov 2012 21:11:11 GMT
X-Varnish: 1782237238
Age: 0
Via: 1.1 varnish
Connection: keep-alive
X-Cache: MISS


This makes the Drupal slug very close to the original Phelps & Wilensky concept of "Robust Hyperlinks Cost Just Five Words Each", which formed the basis for Martin's dissertation mentioned above.  While this is convenient in that it reduces the number of 404s in the world, it is also a bit of a white lie; user agents need to be careful to not assume that the original URI ever existed even though it is issuing a redirect to a target URI. 

Thursday, July 28, 2011

2011-07-28: Web Video Discussing Preservation Disappears After 24 Hours

One week ago (July 21, 2011) I was fortunate enough to be invited to speak about Web Archiving on Canada AM, sort of like the Today Show or Good Morning America in the US. I was asked to appear on the program in part because of the July 17, 2011 article in the Washington Post, which followed a July 6, 2011 blog post for the Chronicle of Higher Education, which was based on a June 23, 2011 blog post about our JCDL 2011 paper "How Much of the Web is Archived?". In other words, the process went like this: step 1 - get lucky & step 2 - let preferential attachment do its thing.

I was able to do the appearance in Washington DC, while attending the NDSA/NDIIPP 2011 Partner Meetup. The morning of July 21, I took a taxi to an ABC studio in DC, did the interview (about 4 minutes) and took a taxi back to the conference in time to make the morning session. I had not been on TV before and was both nervous and excited. The local and Canadian crew made the entire experience painless and the whole interview was over right as I started to get comfortable.

Given the short time, I tried to stress two topics: the first is that the ODU/LANL Memento project is not a new archive, but rather a way to leverage all existing web archives at once (this is a common misunderstanding we've experienced in the past). The other point I tried to make was that much of our cultural discourse occurs on the web and we should try to preserve as much of that as possible (including things like lolcats) because we (collectively) do a bad job at predicting what will be important in the future. Shortly after airing, the video segments was available on-line at:

http://www.ctv.ca/canadaam/?video=504307

As the URI suggests, this is the homepage for Canada AM (http://www.ctv.ca/canadaam/), but with an argument ("?video=504307") specifying which video segment (i.e., each individual story -- not the entire morning's show) to display. I shared the video URI with colleagues, friends, and family and was enjoying my 4 minutes of fame (I should still have 11 left in the bank). I had not made a local copy of the video because their web site obfuscated the actual URI of the streaming video, I had to finish the rest of the conference and drive back to Norfolk, and I thought I would have the time to figure it out after I returned.

So imagine my surprise on Friday at about lunch time when I reload the URI and do not see the video, but instead a newly redesigned Canada AM web page! The video of me making the point that we should save web resources lasted approximately 24 hours. I don't mean to seem ungrateful for the opportunity Canada AM afforded me, but as a professor I try to see everything as a teaching opportunity, so here it goes...

Sometime on Friday morning (July 22), the entire web site was redesigned and the old URIs no longer worked (cf. "Cool URIs Don't Change"). The video id was an argument and is now silently ignored, so even worse than a 404 you now get a "soft 404":

% curl -I http://www.ctv.ca/canadaam/\?video=504307
HTTP/1.1 200 OK

Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html

X-Varnish: 2550613724

Date: Thu, 28 Jul 2011 16:55:48 GMT

Connection: keep-alive


The soft 404 means people clicking on the original video link in Facebook, Twitter, email, etc. won't even see an error page -- they see the new site, but without the video or indication that the video is missing. The new site has a link titled "watch full shows", with the URI:

http://www.ctv.ca/canadaAMPlayer/index.html

Which is textually described as the "Canada AM Video Archive", but the archive begins on July 22, 2011 -- one day after my appearance! The new segments are available at URIs of the form:

http://www.ctv.ca/canadaAMPlayer/index.html?video=504933

The older videos are not available, not even as an argument to the new URI, which also returns a soft 404 (i.e., the video is not available despite the 200 response):

% curl -I http://www.ctv.ca/canadaAMPlayer/index.html\?video=504307
HTTP/1.1 200 OK
Server: Apache/2.2.14 (Ubuntu)
Content-Type: text/html
X-Varnish: 2550976182
Date: Thu, 28 Jul 2011 17:35:35 GMT
Connection: keep-alive


The video ids seem to be continuous (i.e., they did not appear to start over with "1"), so URL rewriting could easily make all the old video URIs continue to work, unless whatever CMS that hosted those videos has been retired with no migration path forward.

Here are some screen shots of the newly redesigned home page (left) and the video archive page (right) from July 22:










Of course, I did not think to make a screen shot of the original home page, or the page of my video because I thought it would live longer than 24 hours! I was able to find a recent (December 8, 2010) copy in the Internet Archive's Wayback Machine:

http://web.archive.org/web/20101208084455/http://www.ctv.ca/canadaam/

And I also pushed the two pages above to WebCite, which nicely contrasts two styles of giving URIs for archived pages (URI-M in Memento parlance):


http://www.webcitation.org/60NizRC0o
http://www.webcitation.org/60Nj60H8D

The IA's URIs violate the W3C "good practice" of URI opacity, but they sure are handy for humans. WebCite actually offers both styles of URIs, for example the latter of the two URIs above is equivalent to:

http://www.webcitation.org/query?url=http%3A%2F%2Fwww.ctv.ca%2FcanadaAMPlayer%2Findex.html&date=2011-07-22

But the resulting URI encoding, while technically correct, is not conducive to easy memorizing and exploration by humans. Different styles of using a URI as an argument to another URI will be explored in a future blog post.

Fortunately I was given a DVD of the session, from which I was able to rip a copy and upload it to YouTube, provided below with the dual interests of vanity and pedagogy. I'm not sure about its status with respect to copyright, so it might disappear in the future as well. It should be covered under fair use, but I would not count on it. However, that is also a topic for another blog post...



--Michael

2012-05-30 Update: Apparently Canada AM did create a new page about the video, including a nice, anonymously authored summary of the material with direct quotes from me:


It appears to be authored on July 24, 2011, not just via the byline but through the HTTP response headers as well.  For example, look at the "Last-Modified" header for this image that appears in the page:

% curl -I http://images.ctv.ca/archives/CTVNews/img2/20110721/470_professor_nelson_110721_225128.jpg
HTTP/1.1 200 OK
Server: Apache/2.2.0 (Unix) DAV/2
Last-Modified: Sun, 24 Jul 2011 10:52:59 GMT
ETag: "a9e08e-51a4-807938c0"
Accept-Ranges: bytes
Content-Length: 20900
Content-Type: image/jpeg
Date: Wed, 30 May 2012 14:02:21 GMT
Connection: keep-alive

I originally wrote the above article on July 28, 2011 and I was unable to find any trace of my appearance on their site.  Perhaps I just missed it, or perhaps it was written but not yet linked.  This nicely illustrates the premise behind Martin Klein's PhD research: things rarely disappear completely, they just move to a new location; the trick is finding them.

Friday, June 17, 2011

2011-06-17: The "Book of the Dead" Corpus

We are delighted to introduce the "Book of the Dead", a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006.

We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure" published at JCDL 2010. In addition we now thankfully have Synchronicity, a tool that can help overcome the 404 detriment to everyone's browsing experience in real time.

To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at: http://bit.ly/Book-of-the-Dead

And one more thing... not only does the corpus include the missing URIs, it also contains a best guess of what each of the URIs used to be about. We used Amazon's Mechanical Turk and asked workers to guess what the content of the missing pages used to. We only provided the URIs and the general topics elections and terror. The workers were supposed to just analyze the URI and draw their conclusions. Sometime this can be an easy task, for example the URI:

http://www.de.lp.org/election2004/morris.html

is clearly about an election event in 2004. Maybe one could know that "lp" stands for Libertarian Party and "de" for Delaware. Now this URI makes real sense and most likely "Morris" was a candidate running for office during the elections.

All together the Book of the Dead now offers missing URIs and their estimated "aboutness" which makes it a valuable dataset for retrieval and archival research.
--
martin

Friday, June 10, 2011

2011-06-10: Launching Synchronicity - A Firefox Add-on for Rediscovering Missing Web Pages in Real Time


Today we introduce Synchronicity, a Firefox extension that supports the user in rediscovering missing web pages. It triggers on the occurrence of 404 "Page not Found" errors, provides archived copies of the missing page as well as five methods to query search engines for the new location of the page (in case it has moved) or to obtain a good enough replacement page (in case the page is really gone).
Synchronicity works in real time and helps to overcome the detriment of link rot in the web.

Installation:
Download the add-on from https://addons.mozilla.org/en-US/firefox/addon/synchronicity and follow the installation instructions. After restarting Firefox you will notice Synchronicity's shrimp icon in the right corner of the status bar.

Usage:
Whenever a 404 "Page not Found" error occurs the little icon will change colors and turn to notify the user that it has caught the error. Just click once on the red icon and the Synchronicity panel will load up.
Synchronicity utilizes the Memento framework to obtain archived copies of a page. On startup you are in the Archived Version tab where two visualizations of all available archived copies are offered.
The TimeGraph is a static image giving an overview of the number of copies available per year. Three drop down boxes enable you to pick a particular copy by date and have it display in the main browser window.
The TimeLine offers a "zoomable" way to explore the copies in dependence of the time they were archived. Each copy is represented by the icon of its hosting archive. You can click on the icon to receive metadata about the copy and see a link that will display the copy. You can also filter the copies by their archive.


Based on these copies Synchronicity provides two content based methods:
  1. the title of the page
  2. the keywords (lexical signature) of the page
that both can be used as queries against Google, Yahoo! and Bing. The idea is that these queries represent the "aboutness" of the missing page and hence make a good query to discover the page at its new location (URI) or a discover a good enough replacement page that satisfies the user's information need.


Synchronicity can further obtain tags from Delicious created by users to annotate the page. Even thought tags are sparse, if available they can make a well performing search engine query. Additionally Synchronicity will extract the most salient keywords from pages that link to the missing page (link neighborhood lexical signature) that again can be used as a query.
Lastly Synchronicity offers a convenient way to modify the URL that caused the 404 error and try. The idea is that maybe shortening the path will get where you want to go.

These last three methods can be applied if no archived copy of the missing page can be found.

Synchronicity provides a straight forward interface but also enables more experienced users to modify all parameters underlying the extraction of titles, keywords, tags and extended keywords. The Expert Interface lets you for example show the titles of the last n copies where you specify the value of n. It also enables you to pick a particular copy to extract the keywords from and change many more parameters.



Notes:
Synchronicity is a beta release so do not let it perform open-heart surgery on your mother-in-law!
It was developed within the the WS-DL research group in the Computer Science Department at Old Dominion University by Moustafa Aly and Martin Klein under supervision of Dr. Michael L. Nelson.

Please send your feedback, comments and suggestions for improvement to
synchronicity-info@googlegroups.com

--
martin