Web Science and Digital Libraries Research Group: Perma.cc

Showing posts with label Perma.cc. Show all posts

Wednesday, February 22, 2017

2017-02-22: Archive Now (archivenow): A Python Library to Integrate On-Demand Archives

Examples: Archive Now (archivenow) CLI

A small part of my research is to ensure that certain web pages are preserved in public web archives to hopefully be available and retrievable whenever needed at any time in the future. As archivists believe that "lots of copies keep stuff safe", I have created a Python library (Archive Now) to push web resources into several on-demand archives, such as The Internet Archive, WebCite, Perma.cc, and Archive.is. For any reason, one archive stops serving temporarily or permanently, it is likely that copies can be fetched from other archives. By Archive Now, one command like:

$ archivenow --all www.cnn.com

is sufficient for the current CNN homepage to be captured and preserved by all configured archives in this Python library.

Archive Now allows you to accomplish the following major tasks:

A web page can be pushed into one archive
A web page can be pushed into multiple archives
A web page can be pushed into all archives
Adding new archives
Removing existing archives

Install Archive Now from PyPI:
$ pip install archivenow

To install from the source code:
$ git clone git@github.com:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

"pip", "archivenow", and "docker" may require "sudo"

Archive Now can be used through:

The CLI (or A Docker Container)
A Web Service
Python code

1. The CLI

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag:
$ archivenow -h
usage: archivenow [-h][--cc][--cc_api_key [CC_API_KEY]]
[--ia][--is][--wc][-v][--all][--server]
[--host [HOST]][--port [PORT]][URI]
positional arguments:
URI URI of a web resource
optional arguments:
-h, --help show this help message and exit
--cc Use The Perma.cc Archive
--cc_api_key [CC_API_KEY]
An API KEY is required by The Perma.cc
Archive
--ia Use The Internet Archive
--is Use The Archive.is
--wc Use The WebCite Archive
-v, --version Report the version of archivenow
--all Use all possible archives
--server Run archiveNow as a Web Service
--host [HOST] A server address
--port [PORT] A port number to run a Web Service

Examples:

To archive the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and The Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com http://archive.is/fPVyc

To save the web page (www.foxnews.com) in all configured web archives:

$ archivenow --all www.foxnews.com --cc_api_key $Your-Perma-CC-API-Key
https://perma.cc/8YYC-C7RM
https://web.archive.org/web/20170220074919/http://www.foxnews.com
http://archive.is/jy8B0
http://www.webcitation.org/6o9IKD9FP

Run it as a Docker Container (you need to do "docker pull" first)

$ docker pull maturban/archivenow

$ docker run -it --rm maturban/archivenow -h
$ docker run -p 80:12345 -it --rm maturban/archivenow --server
$ docker run -p 80:11111 -it --rm maturban/archivenow --server --port 11111
$ docker run -it --rm maturban/archivenow --ia http://www.cnn.com
...

2. A Web Service

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 11111)

$ archivenow --server
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://127.0.0.1:12345/ia/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 95
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:29:23 GMT

{
"results": [
"https://web.archive.org/web/20170209142922/http://www.foxnews.com"
]
}

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 172
Server: Werkzeug/0.11.15 Python/2.7.10
Date: Thu, 09 Feb 2017 14:33:47 GMT

{
"results": [
"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
"http://archive.is/H2Yfg",
"http://www.webcitation.org/6o9Jubykh",
"Error (The Perma.cc Archive): An API KEY is required"
]
}

you may use the Perma.cc API_Key as following:

$ curl -i http://127.0.0.1:12345/all/www.foxnews.com?cc_api_key=$Your-Perma-CC-API-Key

3. Python Usage

>>> from archivenow import archivenow

To save the web page (www.foxnews.com) in The WebCite Archive:

>>> archivenow.push("www.foxnews.com","wc")
['http://www.webcitation.org/6o9LTiDz3']

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc","cc_api_key=$Your-Perma-cc-API-KEY")
['https://perma.cc/8YYC-C7RM']

To start the server from Python do the following. The server/port number can be passed (e.g,

start(port=1111, host='localhost')):
>>> archivenow.start()
* Running on http://127.0.0.1:12345/ (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Adding a new archive is as simple as adding a handler file in the folder "handlers". For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write ">>>archivenow.push("www.cnn.com","ma")". In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. It might be helpful to see how other "*_handler.py" organized.

Removing an archive can be done by one of the following options:

Removing the archive handler file from the folder "handlers"
Rename the archive handler file to other name that does not end with "_handler.py"
Simply, inside the handler file, set the variable "enabled" to "False"

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the 'same' resource. For example, if you send a request to the IA to capture (www.cnn.com) at 10:00pm, the IA will create a new memento (let's call it M1) of the CNN homepage. The IA will then return M1 for all requests to archive the CNN homepage received before 10:02pm. The Archive.is sets this time gap to five minutes.

Updates and pull requests are welcome: https://github.com/oduwsdl/archivenow

--Mohamed Aturban

2017-11-13 Edit:

We added a new UI page to "archivenow". It will show up when running "archivenow" as a web service. This page allows users to submit a URL to selected archives as shown below:

You can install and run "archivenow" as a web service using the "pip" command or a Docker container (you can change the default IP and/or port number):

As a Docker container:

$ docker run -p 22222:11111 -it --rm maturban/archivenow --server --port 11111 --host 0.0.0.0

Using pip:

$ pip install archivenow
$ archivenow --server

Friday, September 9, 2016

2016-09-09: Summer Fellowship at the Harvard Library Innovation Lab Trip Report

Alexander Nwala standing at the main entrance of Langdell Hall

Myself standing at the main entrance of Langdell Hall

I was honored with the great opportunity of collaborating with the Harvard Library Innovation Lab (LIL) as a Fellow this Summer. Located at Langdell Hall, Harvard Law School, the Library Innovation Lab develops solutions to solve serious problems facing libraries. It consists of an eclectic group of Lawyers, Librarians, and Software Developers engaged in projects such as Perma.cc, Caselaw Access Project (CAP), The Nuremberg Project, among many others.

The LIL Team

To help prevent link rot, Perma.cc creates permanent reliable links for web resources. The Caselaw Access Project is an ambitious project which strives to make all US case laws freely accessible online. The current collection to be digitized stands at over 42,000 volumes (nearly 40 million pages). The Nuremberg Project is concerned with the digitization of LIL's collection about the Nuremberg trials.

How Harvard digitized nearly 40 million pages of case law: https://t.co/DE1dgb7ReC pic.twitter.com/gecDQ3YiEf
— WBUR (@WBUR) August 30, 2016

I started work on June 6, 2016 (through August 24) as one of seven Summer Fellows, and was supervised by Adam Ziegler, LIL’s Managing Director. During the first week of the fellowship, we (Summer Fellows) were given a tour around the Harvard Law School Library and had the opportunity to share our research plans in the first Fellows hour - a session in which Fellows reported research progress, and received feedback from the LIL team as well as other Fellows. The Fellowship program was structured such that we had the flexibility to research subjects that interested us.

The 2016 LIL Summer Fellows

Harvard LIL 2016 Summer Fellows (See LIL's blog)

1. Neel Agrawal: Neel is a Law Librarian at LA Law Library Los Angeles, California. He is also a professional percussionist in various musical contexts such as Fusion, Indian and Western classical. He spent the Summer researching African drumming laws, to understand why/how colonial Government controlled, criminalized, and regulated drumming in Western/Northern Nigeria, Ghana, Uganda, Malawi, The Gambia, and Seychelles.

2. Jay Edwards: Jay was the lead database engineer for Obama for America in 2012 and also the ninth employee at Twitter. He spent the Summer working on the Caselaw Access Project, building a platform to enable non-programmers use Caselaw data.

3. Sara Frug: Sara is the Associate Director of the Cornell Law School Legal Information Institute, where she manages the engineering team which designs various tools that improve the accessibility and usability of legal text. Sara spent the Summer further researching how to improve the accessibility of legal text by developing a legal text data model.

4. Ilya Kreymer: Ilya is the creator of Webrecorder and oldweb.today. Webrecorder is an interactive archiving tool which helps users create high-fidelity web archives of websites by simply browsing through the tool. Ilya spent the Summer improving Webrecorder.

5. Muira McCammon: Muira just concluded her M.A in Comparative Literature/Translation Studies at the University of Massachusetts-Amherst and received her B.A. in International Relations and French from Carleton College. Her M.A thesis was about the history of the Guantanamo Bay Detainee Library. She spent the Summer further expanding her GiTMO research by drafting a narrative nonfiction book on her GiTMO research, designing a tabletop wargame to model the interaction dynamics of various parties at GiTMO and organizing a GiTMO conference.

6. Alexander Nwala: I am a computer science Ph.D student at Old Dominion University under the supervision of Dr. Michael Nelson. I worked on projects such as Carbon date, What did it look like?, and I Can Haz Memento. Carbon date helps you estimate the birth date of website, and What did it look like renders an animated GIF which shows how a website changed over time. I spent the Summer expanding my current research which is concerned with building collections for stories and events.

7. Tiffany Tseng: Tiffany is the creator of Spin and a Ph.D graduate of the LiFELONG KiNDERGARTEN group of the MIT media Lab. Spin is a photography turnable system used for capturing animations of the evolution of design projects. Her research at MIT primarily focused on supporting designers and makers document and share their design process. Tiffany also has a comprehensive knowledge about a wide range of snacks.

Interesting things happen when you have a group comprising of scholars from different fields with different interest together. The opportunity to learn about our various research from different perspectives as offered by the Fellows and LIL team was constant. Progress was constant, as was scrum and button making.

A few buttons assembled during one of the many button making rituals at LIL

The 2016 LIL Summer Fellowship concluded with a Fellows share event in which the seven Summer Fellows presented the outcome of their work during the Fellowship.

During the presentation, Neel talked about his interactive African drumming laws website.

A paid permit was required by law in order to drum in the Western Nigeria District Councils

The website provides an online education experience by tracing the creation of about 100 drumming laws between the 1950s and 1970s in District Councils throughout Western Nigeria.

88 CPU Cores processing the CAP XML data

Jay talked about the steps he took in order make the dense XML Caselaw data searchable by first validating the Caselaw XML files. Second, he converted the files to a columnar data store format (Parquet). Third, he loaded the Caselaw preprocessed data into Apache Drill in order to provide query capabilities.

Examples of different classification system of legal text: Eurovoc (Left), Library of Congress Subject Headings (Center), and Legistlative Indexing Vocabulary (Right)

Sara talked about a general data model she developed which enables developers to harness information available in different legal text classification systems, without having to understand the specific details of each system.

Watching a cool demo of @webrecorder_io at @HarvardLIL. pic.twitter.com/5nWfs1Sitp
— Matt Zagaja (@mzagaja) August 24, 2016

Ilya demonstrated the new capabilities in the new version of Webrecorder.

Now @muira_mccammon is talking about trying to get a library card at Gitmo.
— Matt Zagaja (@mzagaja) August 24, 2016

Muira talked about her investigation about GiTMO and other detainee libraries. She highlighted her work with the Harvard Law School Library to create a Twitter archive for Carol Rosenberg's (Miami Herald Journalist) tweets. She also talked about her experiences in filing Freedom Of Information Act (FOIA) requests.

Now @acnwala is talking about Local Memory and archiving local news.
— Matt Zagaja (@mzagaja) August 24, 2016

LocalMemory Geo will return newspapers and tv and radio stations based on proximity to zip code. https://t.co/pvSDC3npFj
— Matt Zagaja (@mzagaja) August 24, 2016

I presented the Geo derivative of the Local Memory Project which maps zip codes to local news media outlets. I also presented a non-public prototype of the Local Memory Project Google chrome extension. The extension helps users build/archive/share collections about local events or stories collected from local news media outlets.

Tiffany's work at Hatch Makerspace: Spin setup (left), PIx documentation station (center), and PIx whiteboard for sharing projects (right)

The presentations concluded with Tiffany's talk about her collaboration with HATCH - a makerspace run by Watertown Public Library. She also talked about her work improving Spin (a turntable system she created).

I will link to the Fellows share video presentation and booklet when LIL posts them.

-- Nwala (@acnwala)

Monday, July 14, 2014

2014-07-14: "Refresh" For Zombies, Time Jumps

We've blogged before about "zombies", or archived pages that reach out to the live web for images, ads, movies, etc. You can also describe it as the live web "leaking" into the archive, but we prefer the more colorful metaphor of a mixture of undead and living pages. Most of the time Javascript is to blame (for example, see our TPDL 2013 paper "On the Change in Archivability of Websites Over Time"), but in this example the blame rests with the HTML <meta http-equiv="refresh" content="..."> tag, whose behavior in the archives I discovered quite by accident.

First, the meta refresh tag is a nasty bit of business that allows HTML to specify the HTTP headers you should have received. This is occasionally useful (like loading a file from local disk), but more often that not seems to create situations in which the HTML and the HTTP disagree about header values, leading to surprisingly complicated things like MIME type sniffing. In general, having data formats specify protocol behavior is a bad idea (see the discussion about orthogonality in the W3C Web Architecture), but few can resist the temptation. Specifically, http-equiv="refresh" makes things even worse, since the HTTP header "Refresh" never officially existed, and it was eventually dropped from the HTML specification as well.

However, it is a nice illustration of a common but non-standard HTML/fake-HTTP extension that nearly everyone supports. Here's how it works, using www.cnn.com as an example:

This line:

<meta http-equiv="refresh" content="1800;url=http://www.cnn.com/?refresh=1"/>

tells the client to wait 30 minutes (1800 seconds) and reload the current page with the value specified in the optional url= argument (if no URL is provided, the client uses the current page's URL). CNN has used this "wait 30 minutes and reload" functionality for many years, and it is certainly desirable for a news site to cause the client to periodically reload its front page. The problem comes when a page is archived, but the refresh capability is 1) not removed or 2) the URL argument is not (correctly) rewritten.

Last week I had loaded a memento of cnn.com from WebCitation, specifically: http://webcitation.org/5lRYaE8eZ, that shows the page as it existed at 2009-11-21:

I hid that page, did some work, and then when I came back I noticed that it had reloaded to the page as of 2014-07-11, even though the URL and the archival banner at the top remained unchanged:

The problem is that WebCitation leaves the meta refresh tag as is, causing the page to reload from the live web after 30 minutes. I had never noticed this behavior before, so I decided to check how some other archives handle it.

The Internet Archive rewrites the URL, so although the client still refreshes the page, it gets an archived page. Checking:

http://web.archive.org/web/20091121211700/http://www.cnn.com/

we find:

<meta http-equiv="refresh" content="1800;url=/web/20091121211700/http://www.cnn.com/?refresh=1">

But since the IA doesn't know to canonicalize www.cnn.com/?refresh=1 to www.cnn.com, you actually get a different archived page:

Instead of ending up on 2009-11-21, we end up two days in the past at 2009-11-19:

To be fair, ignoring "?refresh=1" is not a standard canonicalization rule but could be added (standard caveats apply). And although this is not quite a zombie, it is potentially unsettling since the original memento (2009-11-21) is silently exchanged for another memento (2009-11-19; future refreshes will stay on the 2009-11-19 version). Presumably other Wayback-based archives behave similarly. Checking the British Library I saw:

http://www.webarchive.org.uk/wayback/archive/20090914012158/http://www.cnn.com/

redirect to:

http://www.webarchive.org.uk/wayback/archive/20090402030800/http://www.cnn.com/?refresh=1

In this case the jump is more noticable (five months: 2009-09-14 vs. 2009-04-02) since the BL's archives of cnn.com are more sparse.

Perma.cc behaves similarly to the Internet Archive (i.e., rewriting but not canonicalizing), but presumably because it is a newer archive, it does not yet have a "?refresh=1" version of cnn.com archived. It is possible that Perma.cc has a Wayback backend, but I'm not sure. I had to push a 2014-07-11 version into Perma.cc (i.e., it did not already have cnn.com archived). Checking:

http://perma.cc/89QJ-Y632?type=source

we see:

<meta http-equiv="refresh" content="1800;url=/warc/89QJ-Y632/http://www.cnn.com/?refresh=1"/>

And after 30 minutes it will refresh to a framed 404 because cnn.com/?refresh=1 is not archived:

As Perma.cc becomes more populated, the 404 behavior will likely disappear and be replaced with something like the Internet Archive and British Library examples.

Archive.today is the only archive that correctly handled this situation. Loading:

https://archive.today/Zn6HS

produces:

A check of the HTML source reveals that they simply strip out the meta refresh tag altogether, so this memento will stay parked on 2013-06-27 no matter how long it stays in the client.

In summary:

WebCitation did not rewrite the URI and thus created a zombie
Internet Archive (and other Wayback archives) rewrites the URI, but because of site-specific canonicalization, it violates the user's expectations with a single time jump (the distance of which is dependent on the sparsity of the archive)
Perma.cc rewrites the URI, but in this case, because it is a new archive, produces a 404 instead of a time jump
Archive.today strips the meta refresh tag and avoids the behavior altogether

--Michael

2014-07-14: The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript

One very large part of digital preservation is the act of crawling and saving pages on the live Web into a format for future generations to view. To accomplish this, web archivists use various crawlers, tools, and bits of software, often built to purpose. Because of these tools' ad hoc functionality, users expect them to function much better than a general purpose tool.

As anyone that has looked up a complex web page in The Archive can tell you, the more complex the page, the less likely that all resources will be captured to replay the page. Even when these pages are preserved, the replay experience is frequently inconsistent from the page on the live web.

We have started building a preliminary corpus of tests to evaluate a handful of tools and web sites that were created specifically to save web pages from being lost in time.

In homage to the web browser evaluation websites by the Web Standards Project, we have created The Archival Acid Test as a first step in ensuring that these tools to which we supply URLs for preservation are doing their job to the extent we expect.

The Archival Acid Test evaluates features that modern browsers execute well but preservation tools have trouble handling. We have grouped these tests into three categories with various tests under each category:

The Basics

1a - Local image, relative to the test
1b - Local image, absolute URI
1c - Remote image, absolute
1d - Inline content, encoded image
1e - Scheme-less resource
1f - Recursively included CSS

JavaScript

2a - Script, local
2b - Script, remote
2c - Script inline, DOM manipulation
2d - Ajax image replacement of content that should be in archive
2e - Ajax requests with content that should be included in the archive, test for false positive (e.g., same origin policy)
2f - Code that manipulates DOM after a certain delay (test the synchronicity of the tools)
2g - Code that loads content only after user interaction (tests for interaction-reliant loading of a resource)
2h - Code that dynamically adds stylesheets

HTML5 Features

3a - HTML5 Canvas Drawing
3b - LocalStorage
3c - External Webpage
3d - Embedded Objects (HTML5 video)

For the first run of the Archival Acid Tests, we evaluated Internet Archive's Heritrix, GNU Wget (via its recent addition of WARC support), and our own WARCreate Google Chrome browser extension. Further, we ran the test on Archive.org's Save Page Now feature, Archive.today, Mummify.it (now defunct), Perma.cc, and WebCite. For each of these tools, we first attempted to preserve the Web Standards Project's Acid 3 Test (see Figure 1).

The results for this initial study (Figure 2) were accepted for publication (see the paper) to the Digital Libraries 2014 conference (joint JCDL and TPDL this year) and will be presented September 8th-14th in London, England.

The actual test we used is available at http://acid.matkelly.com for you to exercise with your tools/websites and the code that runs the site is available on GitHub.

— Mat Kelly (@machawk1)

Web Science and Digital Libraries Research Group