Web Science and Digital Libraries Research Group: November 2016

Monday, November 21, 2016

2016-11-21: WS-DL Celebration of #IA20

.@WebSciDL celebrates 20 years of #webarchiving & @internetarchive w tacos and @djspooky CDs! #IA20 pic.twitter.com/AFb3qUiuzz
— Michael L. Nelson (@phonedude_mln) October 26, 2016

The Web Science & Digital Library Research Group celebrated the 20th Anniversary of the Internet Archive with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving. This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky).

Normally our group posts about research developments and technical analysis of web archiving, but for #IA20 we had members of our group write mostly non-technical stories drawn from personal experiences and interests that are made possible by web archiving. We are often asked "Why archive the web?" and we hope these blog posts will help provide you with some answers.

Shawn blogged about archiving "fictional" web sites. For examples, sites that are mentioned in TV shows and then set up in real life (tm) to support the shows.
Lulwah followed up her JCDL 2015 paper with a more personal note about which of her favorite Arabic language web sites were archived.
Mat described his time as an editor of the The Alligator, the student newspaper at the University of Florida.
Scott retold how he applied the results of his archivability research to better the design of a local acoustic music site that he maintains.
Alexander, in anticipation of the 2016 election, examined the mementos for Fox News and CNN web sites for both the 2008 and 2012 US presidential elections.
Yasmin revisited how well archived are some of the sites, no longer on the live web, documenting the 2011 Egyptian Revolution.
Sawood blogged about early Urdu language blogs and how they disappeared from the live web, Geocities-style.
Corren recounted the rise and fall of one of the first online universities, as captured by the Internet Archive.
Mohamed summarized the fate of various Libyan newspapers controlled by the government after the 2011 conflict in Libya.
Erika pulled some examples from the web archives of hacked pages from the 2013 cyber war between Indonesia and Australia.

We've collected these links and more material related to #IA20 in both a Storify story and a Twitter moment; we hope you can take the time to explore them further. We'd like to thank everyone at the Internet Archive for 20 years of yeoman's work, the many other archives that have come on-line more recently, and all of the WS-DL members who made the time to provide their personal stories about the impacts and opportunities of web archiving.

--Michael

Wednesday, November 16, 2016

2016-11-16: Reminiscing About The Days of Cyber War Between Indonesia and Australia

Image is taken from Wikipedia

Indonesia and Australia are neighboring countries that, just like what always happens between neighbors, have a hot-and-cold relationship. The History has recorded a number of disputes between Indonesia and Australia, from East Timor disintegration (now Timor Leste) in 1999 to the Bali Nine case (the execution of Australian drug smugglers) in 2015. One of the issues that has really caused a stir in Indonesia-Australia's relationship is the spying imbroglio conducted by Australia toward Indonesia. The tension arose when an Australian newspaper The Sydney Morning Herald published an article titled Exposed: Australia's Asia spy network and a video titled Spying at Australian diplomatic facilities on October 31st, 2013. It revealed one of Edward Snowden's leaks that Australia had been spying on Indonesia since 1999. This startling fact surely enraged Indonesia's government and, most definitely, the people of Indonesia.

Indonesia strongly demanded clarification and an explanation by summoning Australia's ambassador, Greg Moriarty. Indonesia also demanded Australia to apologize. But Australia refused to apologize by arguing that this is something that every government will do to protect its country. The situation was getting more serious when it was also divulged that an Australian security agency attempted to listen in on Indonesian President Susilo Bambang Yudhoyono's cell phone in 2009. Yet, Tony Abbott, Australia's prime minister at that time, still refused to give either explanation or apology. This caused President Yudhoyono to accuse Tony Abbott of 'belittling' Indonesia's response to the issue. All of these situations made the already enraged Indonesian became more furious. Furthermore, Indonesian people judged that the government was too slow in following up and responding to this issue.

Image is taken from The Australian

To channel their frustration and anger, a group of Indonesian hacktivists named 'anonymous Indonesia' launched a number of attacks to hundreds of Australian websites that were chosen randomly. They hacked and defaced those websites to spread the message 'stop spying on Indonesia'. Over 170 Australian websites were hacked during November 2013, some of them are government websites such as Australian Secret Intelligence Service (ASIS), Australian Security Intelligence Organisation (ASIO), and Department of Foreign Affairs and Trade (DFAT).

Australian hackers also took revenge by attacking several important Indonesian websites such as the Ministry of Law and Human Rights and Indonesia's national airline, Garuda Indonesia. But, the number of the attacked websites is not as many as what have been attacked by the Indonesians. These websites are already recovered now and they look as if the attacks never happened. Fortunately, those who never heard this spying row before, could take advantage of using Internet Archive and go back in time to see how those websites looked like when they got attacked. Unfortunately, not all of those attacked websites have archives for November 2013. For example, according to Sydney Morning Herald and Australian Broadcasting Corporation, the ASIS websites were hacked on November 11, 2013. The Australian newspaper also reported that ASIO website was also hacked on November 13, 2013. But, these incidents were not archived by the Internet Archive as we cannot see any snapshot for the given dates.

https://web.archive.org/web/20130101000000*/http://asis.gov.au

https://web.archive.org/web/20130101000000*/http://asio.gov.au

However, we are lucky enough to have sufficient examples to give us a clear idea of the cyber war that once took place between Indonesia and Australia.

http://web.archive.org/web/20130520072344/http://australianprovidores.com.au

http://web.archive.org/web/20131106225110/http://www.danzaco.com.au/

http://web.archive.org/web/20131112141017/http://defence.gov.au/

http://web.archive.org/web/20131107064017/http://dmresearch.com.au

http://web.archive.org/web/20131109094537/http://www.flufferzcarwashcafe.com.au/

http://web.archive.org/web/20131105222138/http://smartwiredhomes.com.au

- Erika (@erikaris)-

2016-11-16: Introducing the Local Memory Project

Collage made from screenshot of local news websites across the US

The national news media has different priorities than the local news media. If one seeks to build a collection about local events, the national news media may be insufficient, with the exception of local news which “bubbles” up to the national news media. Irrespective of this “bubbling” of some local news to the national surface, the perspective and reporting of national news differs from local news for the same events. Also, it is well known that big multinational news organizations routinely cite the reports of smaller local news organizations for many stories. Consequently, local news media is fundametal to journalism.

It is important to consult local sources affected by local events. Thus the need for a system that helps small communities to build collections of web resources from local sources for important local events. The need for such a system was first (to the best of my knowledge) outlined by Harvard LIL. Given Harvard LIL's interest of helping facilitate participatory archiving by local communities and libraries, and our IMLS-funded interest of building collections for stories and events, my summer fellowship at Harvard LIL provided a good opportunity to collaborate on the Local Memory Project.

Our goal is to provide a suite of tools under the umbrella of the Local Memory Project to help users and small communities discover, collect, build, archive, and share collections of stories for important local events from local sources.

Local Memory Project dataset

We currently have a public json US dataset scraped from USNPL of:

5,992 Newspapers
1,061 TV stations, and
2,539 Radio stations

The dataset structure is documented and comprises of the media website, twitter/facebook/youtube links, rss/open search links, as well as geo-coordinates of the cities or counties in which the local media organizations reside. I strongly believe this dataset could be essential to the media research
community.

There are currently 3 services offered by the Local Memory Project:

1. Local Memory Project - Google Chrome extension:

This service is an implementation of Adam Ziegler and Anastasia Aizman's idea for a utility that helps one build a collection for a local event which did not receive national coverage. Consequently, given a story expressed by a query input, for a place, represented by a zip code input, the Google Chrome extension performs the following operations:

Retrieve a list of local news (Newspapers and TV stations) websites that serve the zip code
For each local news website search Google for stories from all the local news websites retrieved from 1.

The result is a collection of stories for the query from local news sources.

For example, given the problem of building a collection for Zika virus for Miami Florida, we issue the following inputs (Figure 1) to the Google Chrome Extension and click "Submit":

Figure 1: Google Chrome Extension, input for building a collection about Zika virus for Miami FL

After the submit button is pressed the application issues the "zika virus" query to Google with the site directive for newspapers and tv stations for the 33101 area.

Figure 2: Google Chrome Extension, search in progress. Current search in image targets stories about Zika virus from Miami Times

After the search, the result (Figure 3) was saved remotely.

Figure 3: A subset (see complete) of the collection about Zika virus built for the Miami FL area.

Here are examples of other collections built with the Google Chrome Extension (Figures 4 and 5):

Figure 4: A subset (see complete) of the collection about Simone Biles' return for Houston Texas

Figure 5: A subset (see complete) of the collection about Protesters and Police for Norfolk Virginia

The Google Chrome extension also offers customized settings that suit different collection building needs:

Figure 6: Google Chrome Extension Settings (Part 1)

Figure 7: Google Chrom Extension Settings (Part 2)

Google max pages: The number of Google search pages to visit for each news source. Increase if you want to explore more Google pages since the default value is 1 page.
Google Page load delay (seconds): This time delay between loading Google search pages ensures a throttled request.
Google Search FROM date: Filter your search for news articles crawled from this date. This comes in handy if a query spans multiple time periods, but the curator is interested in a definite time period.
Google Search TO date: Filter your search for news articles before this date. This comes in handy especially when combined with 3, it can be used to collect documents within a start and end time window.
Archive Page load delay (seconds): Time delay between loading pages to be archived. You can increase this time if you want to have the chance to do something (such as hit archive again) before the next archived page loads automatically. This is tailored to archive.is.
Download type: Download to your machine for a personal collection in (json or txt format). But if you choose to share, save remotely (you should!)
Collection filename: Custom filename for collection about to be saved.
Collection name: Custom name for your collection. It's good practice to label collections.
Upload a saved collection (.json): For json collections saved locally, you may upload them to revisualize the collection.
Show Thumbnail: A flag that decides whether to send a remote request to get a card (thumbnail summary) for the link. Since cards require multiple GET requests, you may choose to switch this off if you have a large collection.
Google news: The default search of the extension is the generic Google search page. Check this box to search teh Google news vertical instead.
Add website to existing collection: Add a website to an existing collection.

2. Local Memory Project - Geo service:

The Google Chrome extension utilizes the Geo service to find media sources that serve a zip code. This service is an implementation of Dr. Michael Nelson's idea for a service that supplies an ordered list of media outlets based on their proximity to a user-specified zip code.

Figure 8: List of top 10 Newspapers, Radio and TV station closest to zip code 23529 (Norfolk, VA)

3. Local Memory Project - API:

The local memory project Geo website is meant for human users, while the API website targets machine users. Therefore, it provide the same services as the Geo website but returns a json output (as opposed to HTML). For example, below is a subset output (see complete) corresponding to a request for 10 news media sites in order of proximity to Cambridge, MA.

{
  "Lat": 42.379146, 
  "Long": -71.12803, 
  "city": "Cambridge", 
  "collection": [
    {
      "Facebook": "https://www.facebook.com/CambridgeChronicle", 
      "Twitter": "http://www.twitter.com/cambridgechron", 
      "Video": "http://www.youtube.com/user/cambchron", 
      "cityCountyName": "Cambridge", 
      "cityCountyNameLat": 42.379146, 
      "cityCountyNameLong": -71.12803, 
      "country": "USA", 
      "miles": 0.0, 
      "name": "Cambridge Chronicle", 
      "openSearch": [], 
      "rss": [], 
      "state": "MA", 
      "type": "Newspaper - cityCounty", 
      "website": "http://cambridge.wickedlocal.com/"
    }, 
    {
      "Facebook": "https://www.facebook.com/pages/WHRB-953FM/369941405267", 
      "Twitter": "http://www.twitter.com/WHRB", 
      "Video": "http://www.youtube.com/user/WHRBsportsFM", 
      "cityCountyName": "Cambridge", 
      "cityCountyNameLat": 42.379146, 
      "cityCountyNameLong": -71.12803, 
      "country": "USA", 
      "miles": 0.0, 
      "name": "WHRB 95.3 FM", 
      "openSearch": [], 
      "rss": [], 
      "state": "MA", 
      "type": "Radio - Harvard Radio", 
      "website": "http://www.whrb.org/"
    }, ...

Saving a collection built with the Google Chrome Extension

Collection built on a user machine can be saved in one of two ways:

Save locally: this serves as a way to keep a collection private. Saving can be done by clicking "Download collection" in the Generic settings section of the extension settings. A collection can be saved in json or plaintext format. The json format permits the collection to be reloaded through "upload a saved collection" in the Generic settings section of the extension settings. The plaintext format does not permit reloading into the extension, but contains all the links which make up the collection.
Save remotely: in order to be able to share the collection you built locally with the world, you need to save remotely by clicking the "Save remotely" button on the frontpage of the application. This leads to a dialog requesting a mandatory unique collection author name (if one doesn't exist) and an optional collection name (Figure 10). After supplying the inputs the application saves the collection remotely and the user is presented with a link to the collection (Figure 11).

Before a collection is saved locally or remotely, you may choose to exclude an entire news source (all links from a given source) or a single news source as described by Figure 9:

Figure 9: Exclusion options before saving locally/remotely

Figure 10: Saving a collection prompts a dialog requesting a mandatory unique collection author name and an optional collection name

Figure 11: A link is presented after a collection is saved remotely

Archiving a collection built with the Google Chrome Extension

Saving is the first step to make a collection persist after it is built. However, archiving ensures that the links referenced in a collection persist even if the content is moved or deleted. Our application currently integrates archiving via Archive.is, but we plan to expand the archiving capability to include other public web archives.

In order to archive your collection, click the "Archive collection" button on the frontpage of the application. This leads to a dialog similar to the saving dialog which requests a mandatory unique collection author name (if one doesn't exist) and an optional collection name. Subsequently, the application archives the collection by first archiving the front page which contains all the local news sources, and secondly, the application archives the individual links which make up the collection (Figure 12). You may choose to stop the archiving operation at any time by clicking "Stop" on the archiving update orange-colored message bar. At the end of the archiving process, you get a short URI corresponding to the archived collection (Figure 13).

Figure 12: Archiving in progress

Figure 13: When the archiving is complete, a short link pointing to the archived collection is presented

Community collection building with the Google Chrome Extension

We envision a community of users contributing to a single collection for a story. Even though the collections are built in isolation, we consider a situation in which we can group collections around a single theme. To begin this process, the Google Chrome Extension lets you share a locally built collections on Twitter by clicking the "Tweet" button (Figure 14).

Figure 14: Tweet button enables sharing the collection

This means if user 1 and user 2 locally build collections for Hurricane Hermine, they may use the hashtags #localmemory and #hurricanehermine when sharing the collection. Consequently, all Hurricane Hermine-related collections will be seen via Twitter with the hashtags. We encourage users to include #localmemory and the collection hashtags in tweets when sharing collections. We also encourage you to follow the Local Memory Project on Twitter.

.@localmem local memory collection, hermine (Hilton Head SC): https://t.co/EaYsRYYGy5 https://t.co/wZQ8su9sIw #localmemory #hurricanehermine
— Alexander C. Nwala (@acnwala) September 3, 2016

The local news media is a vital organ of journalism, but one in decline. We hope by providing free and open source tools for collection building, we can contribute in some capacity to help its revival.

I am thankful for everyone who has contributed to the ongoing success of this project. From Adam, Anastasia, Matt, Jack and the rest of the Harvard LIL team, to my Supervisor Dr. Nelson and Dr. Weigle, and Christie Moffat at the National Library of Medicine, as well as Sawood and Mat and the rest of my colleagues at WSDL, thank you.

--Nwala

Monday, November 7, 2016

2016-11-07: Linking to Persistent Identifiers with rel="identifier"

Do you remember hearing about that study that found that people who are "good" at swearing actually have a large vocabulary, refuting the conventional wisdom about a "poverty-of-vocabulary"? The DOI (digital object identifier) for the 2015 study is*:

http://dx.doi.org/10.1016/j.langsci.2014.12.003

But if you read about it in the popular press, such as the Independent or US News & World Report, you'll see that they linked to:

http://www.sciencedirect.com/science/article/pii/S038800011400151X

The problem is that although the DOI is the preferred link, browsers follow a series of redirects from the DOI to the ScienceDirect link, which is then displayed in the address bar of the browser, and that's the URI that most people are going to copy and paste when linking to the page. Here's a curl session showing just the HTTP status codes and corresponding Location: headers for the redirection:

$ curl -iL --silent http://dx.doi.org/10.1016/j.langsci.2014.12.003 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 303 See Other
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectSinglePerm?Redirect=http%3A%2%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS038800011400151X%3Fvia%253Dihubkey=072c950bffe98b3883e1fa0935fb56a6f1a1b364
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S038800011400151X
HTTP/1.1 200 OK

Most publishers follow this model of a series of redirects to implement authentication, tracking, etc. While DOI use has made significant progress in scholarly literature, many times the final URL is the one that is linked to instead of the more stable DOI (see the study by Herbert, Martin, and Shawn presented at WWW 2016 for more information). Furthermore, while sometimes the mapping between the final URL and DOI is obvious (e.g., http://dx.doi.org/10.1371/journal.pone.0115253 --> http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253), the above example proves that's not always the case.

Ad-hoc linking back to DOIs

One of the obstacles limiting the correct linking is that there is no standard, machine-readable method for the HTML from the final URI to link back to its DOI (and by "DOI" we also mean all other persistent identifiers, such as handles, purls, arks, etc.). In practice, each publisher adopts its own strategy for specifying DOIs in <meta> HTML elements:

In http://link.springer.com/article/10.1007%2Fs00799-016-0184-4 we see:

<meta name="citation_publisher" content="Springer Berlin Heidelberg"/>
<meta name="citation_title" content="Web archive profiling through CDX summarization"/>
<meta name="citation_doi" content="10.1007/s00799-016-0184-4"/>
<meta name="citation_language" content="en"/>
<meta name="citation_abstract_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_fulltext_html_url" content="http://link.springer.com/article/10.1007/s00799-016-0184-4"/>
<meta name="citation_pdf_url" content="http://link.springer.com/content/pdf/10.1007%2Fs00799-016-0184-4.pdf"/>

In http://www.dlib.org/dlib/january16/brunelle/01brunelle.html we see:

<meta charset="utf-8" />
<meta id="DOI" content="10.1045/january2016-brunelle" />
<meta itemprop="datePublished" content="2016-01-16" />
<meta id="description" content="D-Lib Magazine" />

In http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253 we see:

<meta name="citation_doi" content="10.1371/journal.pone.0115253" />
...
<meta name="dc.identifier" content="10.1371/journal.pone.0115253" />

In https://www.computer.org/csdl/proceedings/jcdl/2014/5569/00/06970187-abs.html we see:

<meta name='doi' content='10.1109/JCDL.2014.6970187' />

And in http://ieeexplore.ieee.org/document/754918/ there are no HTML elements specifying the corresponding DOI. Furthermore, HTML elements can only appears in HTML -- which means you can't provide Links for PDF, CSV, Zip, or other non-HTML representations. For example, NASA uses handles for the persistent identifiers of the PDF versions of their reports:

$ curl -IL http://hdl.handle.net/2060/19940023070
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Location: http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Expires: Thu, 03 Nov 2016 17:47:07 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 221
Date: Thu, 03 Nov 2016 17:47:07 GMT

HTTP/1.1 301 Moved Permanently
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Location: https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Content-Length: 984250
Content-Type: application/pdf

And the final PDF obviously cannot use HTML elements to link back to its handle.

To address these shortcomings, and in support of our larger vision of Signposting the Scholarly Web, we are proposing a new IANA link relation type, rel="identifier", that will support linking from the final URL in the redirection chain (AKA as the "locating URI") back to the persistent identifier that ideally one would use to start the resolution. For example, in the NASA example above the PDF would link back to its handle with the proposed Link header in red:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier"
Content-Length: 984250
Content-Type: application/pdf

And in the Language Sciences example that we began with, the final HTTP response (which returns the HTML landing page) would use the Link header like this:

HTTP/1.1 200 OK
Last-Modified: Fri, 04 Nov 2016 00:36:50 GMT
Content-Type: text/html
X-TransKey: 11/03/2016 20:36:50 EDT#2847_006#2415#68.228.137.112
X-RE-PROXY-CMP: 1
X-Cnection: close
X-RE-Ref: 0 1478219810005195
Server: www.sciencedirect.com
P3P: CP="IDC DSP LAW ADM DEV TAI PSA PSD IVA IVD CON HIS TEL OUR DEL SAM OTR IND OTC"
Vary: Accept-Encoding, User-Agent
Expires: Fri, 04 Nov 2016 00:36:50 GMT
Cache-Control: max-age=0, no-cache, no-store
Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"
...

But it's not just the landing page that would link back to the DOI, but also the constituent resources that are also part of a DOI-identified object. Below is a request and response for the PDF file in the Language Sciences example, and it carries the same Link: response header as the landing page:

$ curl -IL --silent "http://ac.els-cdn.com/S038800011400151X/1-s2.0-S038800011400151X-main.pdf?_tid=338820f0-a442-11e6-9f85-00000aab0f6b&acdnat=1478451672_5338d66f1f3bb88219cd780bc046bedf"
HTTP/1.1 200 OK
Accept-Ranges: bytes
Allow: GET
Content-Type: application/pdf
ETag: "047508b07a69416a9472c3ac02c5a9a01"
Last-Modified: Thu, 15 Oct 2015 08:11:25 GMT
Server: Apache-Coyote/1.1
X-ELS-Authentication: SDAKAMAI
X-ELS-ReqId: 67961728-708b-4cbb-af64-bb68f1da03ea
X-ELS-ResourceVersion: V1
X-ELS-ServerId: ip-10-93-46-150.els.vpc.local_CloudAttachmentRetrieval_prod
X-ELS-SIZE: 417655
X-ELS-Status: OK
Content-Length: 417655
Expires: Sun, 06 Nov 2016 16:59:44 GMT
Cache-Control: max-age=0, no-cache, no-store
Pragma: no-cache
Date: Sun, 06 Nov 2016 16:59:44 GMT
Connection: keep-alive
Link: <http://dx.doi.org/10.1016/j.langsci.2014.12.003>; rel="identifier"

Although at first glance there seems to be a number of existing rel types (some registered and some not) that would be suitable:

rel="canonical"
rel="alternate"
rel="duplicate"
rel="related"
rel="bookmark"
rel="permalink"
rel="shortlink"

It turns out they all do something different. Below we explain why these rel types are not suitable for linking to persistent identifiers.

rel="canonical"

This would seem to be a likely candidate and it is widely used, but it actually exists for a different purpose: to "identify content that is either duplicative or a superset of the content at the context (referring) IRI." Quoting from RFC 6596:

If the preferred version of a IRI and its content exists at:

http://www.example.com/page.php?item=purse

Then duplicate content IRIs such as:

http://www.example.com/page.php?item=purse&category=bags
http://www.example.com/page.php?item=purse&category=bags&sid=1234

may designate the canonical link relation in HTML as specified in
[REC-html401-19991224]:

<link rel="canonical"
href="http://www.example.com/page.php?item=purse">

In the representative cases shown above, the DOI, handle, etc. is neither duplicative nor a superset of the content. For example, the URI of the NASA report PDF clearly bears some relation to its handle, but the PDF URI is clearly not duplicative nor a superset of the handle. This is reinforced by the semantics of the "303 See Other" redirection, which indicates there are two different resources with two different URIs**. rel="canonical" is ultimately about establishing primacy among the (possibly) many URI aliases for a single resource. For SEO purposes, this avoids splitting Pagerank.

Furthermore, publishers like Springer are already using rel="canonical" (highlighted in red) to specify a preferred URI in their chain of redirects:

$ curl -IL http://dx.doi.org/10.1007/978-3-319-43997-6_35
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://link.springer.com/10.1007/978-3-319-43997-6_35
Expires: Mon, 31 Oct 2016 20:52:26 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 191
Date: Mon, 31 Oct 2016 20:40:48 GMT

HTTP/1.1 302 Moved Temporarily
Content-Type: text/html; charset=UTF-8
Location: http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35
Server: Jetty(9.2.14.v20151106)
X-Environment: live
X-Origin-Server: 19t9ulj5bca
X-Vcap-Request-Id: 48d17c7e-2556-4cff-4b2b-0e6fbae94237
Content-Length: 0
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448562:07a49aef;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=d9cf189bedb640a9b5d55c9d0;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Link: <http://link.springer.com/chapter/10.1007%2F978-3-319-43997-6_35>; rel="canonical"
Server: openresty
X-Environment: live
X-Origin-Server: 19ta3iq6v47
X-Served-By: core-internal.live.cf.private.springer.com
X-Ua-Compatible: IE=Edge,chrome=1
X-Vcap-Request-Id: 5a458b2c-de85-42cd-7157-022c440a9668
X-Vcap-Request-Id: 54b0e2dc-7766-4c00-4f95-d33bdb6c427a
Cache-Control: max-age=0
Expires: Mon, 31 Oct 2016 20:40:48 GMT
Date: Mon, 31 Oct 2016 20:40:48 GMT
Connection: keep-alive
Set-Cookie: sim-inst-token=1:3000168670-3000176756-3001080530-8200972180:1477976448766:c35e0847;Path=/;Domain=.springer.com;HttpOnly
Set-Cookie: trackid=1d67fdfb47ab4a5f94b43326e;Path=/;Domain=.springer.com;HttpOnly
X-Robots-Tag: noarchive

And some publishers use it inconsistently. In this Elsevier example, the content from http://dx.doi.org/10.1016/j.acra.2015.10.004 is indexed at three different URIs:

Even if we accept that the PubMed version is a different resource (i.e., hosted at NLM instead of Elsevier) and should have a separate URI, Elsevier still maintains two different URIs for this article:

http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract
http://www.sciencedirect.com/science/article/pii/S1076633215004535

The DOI resolves to the former URI (academicradiology.org), but it is the latter (sciencedirect.com) that has in the HTML (and not in the HTTP response header):

<link rel="canonical" href="http://www.sciencedirect.com/science/article/pii/S1076633215004535">

Presumably to distinguish this URI from the various URIs that you get starting with http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 instead of the DOI:

$ curl -iL --silent http://linkinghub.elsevier.com/retrieve/pii/S1076633215004535 | egrep -i "(HTTP/1.1|^location:)"
HTTP/1.1 301 Moved Permanently
location: /retrieve/articleSelectPrefsPerm?Redirect=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2Fpii%2FS1076633215004535%3Fvia%253Dihub&key=07077ac16f0a77a870586ac94ad3c000cfa1973f
HTTP/1.1 301 Moved Permanently
location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535?via%3Dihub&ccp=y
HTTP/1.1 301 Moved Permanently
Location: http://www.sciencedirect.com/science/article/pii/S1076633215004535
HTTP/1.1 200 OK

In summary, although "canonical" seems promising at first, the semantics are different from what we propose and publishers are already using it for internal linking purposes. This eliminates "canonical" from consideration.

rel="alternate"

This rel type has been around for a while and has some reserved historical definitions for stylesheets and RSS/Atom, but the general semantics for "alternate" is to provide "an alternate representation of the current document." In practice, this means surfacing different representations for the same resource, but varying in Content-type (e.g., application/pdf vs. text/html) and/or Content-Language (e.g., en vs. fr). Since a DOI, for example, is not simply a different representation of the same resource, "alternate" is removed from consideration.

rel="duplicate"

RFC 6249 specifies how resources can specify resources with different URIs are in fact byte-for-byte equivalent. "duplicate" might suitable for stating equivalence between the PDFs linked at both http://www.academicradiology.org/article/S1076-6332(15)00453-5/abstract and http://www.sciencedirect.com/science/article/pii/S1076633215004535, but we can't use it to link back to http://dx.doi.org/10.1016/j.acra.2015.10.004.

rel="related"

Defined in RFC 4287, "related" is probably the closest to what we propose but its semantics are purposefully vague. A DOI is certainly related to locating URI, but it is also related to a lot of other resources as well: the other articles in a journal issue, other publications by the authors, citing articles, etc. Using "related" to link to DOIs could be ambiguous, and would eventually lead to parsing the linked URI for strings like "dx.doi.org", "handle.net", etc. -- not what we want to encourage.

rel="bookmark"

We initially hoped this could mean "when you press <control-D>, use this URI instead of one in your address bar." Unfortunately, "bookmark" is instead used to identify permalinks for different sections of the document that it appears in. And as a result, it's not even defined for Link: HTTP headers, and thus eliminated from consideration.

rel="permalink"

It turns out that "permalink" was intended for what we thought "bookmark" would be used for, but although it was proposed, it was never registered nor did it gain significant traction ("bookmark" was used instead). It is most closely associated with the historical problem of creating deep links within blogs and as such we choose not to resurrect it for persistent identifiers.

rel="shortlink"

We include this one mostly for completeness since the semantics arguably provide the opposite of what we want: instead of a link to a persistent identifier, it allows linking to a shortened URI. Despite its widespread use, it is actually not registered.

The ecosystem around persistent identifiers is fundamentally different than that of shortened URIs even though they may look similar to the untrained eye. Putting aside the preservation nightmare scenario of bit.ly going out of business or Twitter deprecating t.co, "shortlink" could be used to complement "identifier". Revisiting the NASA example from above, the two rel types could be combined to link to both the handle and the nasa.gov branded shortened URI:

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
<http://go.nasa.gov/2fkvyya>; rel="shortlink"
Content-Length: 984250
Content-Type: application/pdf

Combining rel="identifier" with other Links

The "shortlink" example above illustrates that "identifier" can be combined with other rel type for more expressive resources. Here we extend the NASA example further with rel="self":

HTTP/1.1 200 OK
Date: Thu, 03 Nov 2016 17:47:08 GMT
Server: Apache
Set-Cookie: JSESSIONID=C88324CAB3C27D6D8152C9BE3B322095; Path=/; Secure
Accept-Ranges: bytes
Last-Modified: Fri, 30 Aug 2013 19:15:59 GMT
Link: <http://hdl.handle.net/2060/19940023070>; rel="identifier",
<http://go.nasa.gov/2fkvyya>; rel="shortlink",
<http://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/19940023070.pdf>; rel="self"
Content-Length: 984250
Content-Type: application/pdf

Now the HTTP response for the PDF is self-contained and unambiguously lists all of the appropriate URIs. We could also combine rel="identifier" with version information. arXiv.org does not issue DOIs or handles, but it does mint its own persistent identifiers. Here we propose, using rel types from RFC 5829, how version 1 of an eprint could link both to version 2 (both the next and current version) as well as the persistent identifier (which we also know to be the "latest-version"):

$ curl -I https://arxiv.org/abs/1212.6177v1
HTTP/1.1 200 OK
Date: Fri, 04 Nov 2016 02:31:19 GMT
Server: Apache
ETag: "Tue, 08 Jan 2013 01:02:17 GMT"
Expires: Sat, 05 Nov 2016 00:00:00 GMT
Strict-Transport-Security: max-age=31536000
Set-Cookie: browser=68.228.137.112.1478226679112962; path=/; max-age=946080000; domain=.arxiv.org
Last-Modified: Tue, 08 Jan 2013 01:02:17 GMT
Link: <https://arxiv.org/abs/1212.6177>; rel="identifier latest-version",
<https://arxiv.org/abs/1212.6177v2>; rel="successor-version",
<https://arxiv.org/abs/1212.6177v1>; rel="self"
Vary: Accept-Encoding,User-Agent
Content-Type: text/html; charset=utf-8

The Signposting web site has further examples how rel="identifier" can be used to express the relationship between the persistent identifiers, the "landing page", the "publication resources" (e.g., the PDF, PPT), and the combination of both the landing page and publication resources. We encourage you to explore the analyses of existing publishers (e.g., Nature) and repository systems (e.g., DSpace, Eprints).

In summary, we propose rel="identifier" to standardize linking to DOIs, handles, and other persistent identifiers. HTML <meta> tags can't be used as headers in HTTP responses, and existing rel types such as "canonical" and "bookmark" have different semantics.

We welcome feedback about this proposal, which we intend to eventually standardize with an RFC and register with IANA. Herbert will cover these issues at PIDapalooza, and we will include the slides here after the conference.

2016-11-10 Edit: Herbert's PIDapalooza slides are now available:

PID Signposting Pattern from Herbert Van de Sompel

--Michael & Herbert

* 2017-08-04 edit: Strictly speaking, a DOI by itself is not actually a URI (i.e., "doi:" is not a registered scheme with IANA) and there are various ways to turn DOIs into HTTP URIs (useful for dereferencing on the web) or info URIs (useful for when dereferencing is not desired). Without loss of generality, web-based discussions typically assume the promotion of DOIs to HTTP URIs. Common forms use the resolvers run by CNRI; historically this meant dx.doi.org:

$ curl -I http://dx.doi.org/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:52:21 GMT
Link: <https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:21:50 GMT

I think now just doi.org is preferred:

$ curl -I http://doi.org/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:45:44 GMT
Link: <https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:21:54 GMT

Even the following is possible (but not preferred):

$ curl -I http://hdl.handle.net/10.1016/j.langsci.2014.12.003
HTTP/1.1 303 See Other
Server: Apache-Coyote/1.1
Vary: Accept
Location: http://linkinghub.elsevier.com/retrieve/pii/S038800011400151X
Expires: Fri, 04 Aug 2017 19:51:53 GMT
Link: <https://api.elsevier.com/content/usage/doi/>; rel="dul"
Content-Type: text/html;charset=utf-8
Content-Length: 207
Date: Fri, 04 Aug 2017 19:22:04 GMT

** Technically, a DOI is a "digital identifier of an object" rather than "identifier of a digital object", and thus there is not a representation associated with the resource identified by a DOI (i.e., not an information resource). Relationships like "canonical", "alternate", etc. only apply to information resources, and thus are not applicable to most persistent identifiers. Interested readers are encouraged to further explore the HTTPRange-14 issue.

Saturday, November 5, 2016

2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

Internet Archive & Libyan newspapers logos

Colonel Gaddafi ruled Libya for 42 years after taking power from King Idris in a 1969 military coup. In August 2011, his regime was toppled in the so-called Arab Spring. For more than four decades, media in Libya was highly politicized to support Gaddafi’s regime and secure his power. After the Libyan revolution (in 2011), media became freed from the tight control of the government, and we have seen the establishment of tens if not hundreds of new media organizations. Here is an overview of one side, newspapers, of Gaddafi’s propaganda machine:

71 newspapers and magazines
All monitored and published by the Libyan General Press Corporation (LGPC)
The Jamahiriya News Agency (JANA) was the main source of domestic news
No real political function other than to polish the regime’s image
Publish information provided by the regime

The following are the Libyan most well-known newspapers which are all published by LGPC:

All Libyan newspaper websites are no longer controlled by the government

After the revolution, most of the Libyan newspapers' websites including the website of the Libyan General Press Corporation (LGPC) became controlled by foreign institutions, in particular, by an Egyptian company. Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), and El Fajr El Jadid (www.alfajraljadeed.com/) became Egyptian news websites under different names: Jime News (www.news.aljamahiria.com/), Kifah Arabi (www.news.kifaharabi.com/), and El Fajr El Jadid Alakbaria while the El Zahf Al Akhdar (www.azzahfalakhder.com/) is now a German sport blog. Here are the logos of the new websites (the new websites remain with the same domain name except the alshames.com which redirects to www.news.kifaharabi.com/):

Can we still have access to the old state media?

After this big change in Libya with the fall of the regime, can we still have access to the old state media? (This question might apply to other countries as well. Would any political or regime change in any country lead to loss a part of its digital history?)

Fortunately, Internet Archive has captured thousands of snapshots of the Libyan newspapers' websites. The main pages of Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), El Zahf Al Akhdar (www.azzahfalakhder.com/), and El Fajr El Jadid (www.alfajraljadeed.com/) have been captured 2310, 606, 1398, and 836 times, respectively, by the Internet Archive.

www.aljamahiria.com/ captured 2,310 times by the Internet Archive

www.azzahfalakhder.com/ captured 1,398 times by the Internet Archive

Praise for Qaddafi no longer on the live web

Although we can not conclude that the Internet Archive has captured everything due to the fact that the content in these newspapers was extremely redundant as they focus in praising the regime, the Internet Archive has captured important events, such as the regime's activities during the "2011" revolution, a lot of domestic news and the regime's interpretation of international news, many economic articles, the long process taken by Libyan authorities in order to establish the African Union, Gaddafi's speeches, etc. Below is an example of one of these articles during the Libyan "2011" revolution indicating the "there will be no future for Libya without our leader Gaddafi". This article is no longer available on the live web.

From the Internet Archive https://web.archive.org/web/20

110514103049/http://www.alfajraljadeed.com//full.pdf

From the live web http://www.alfajraljadeed.com//full.pdf

Slides about this post is also available:

--Mohamed Aturban

Web Science and Digital Libraries Research Group