Web Science and Digital Libraries Research Group: Tools

Showing posts with label Tools. Show all posts

Monday, December 3, 2018

2018-12-03: Acidic Regression of WebSatchel

Mat Kelly reviews WebSatchel, a browser based personal preservation tool. ⓖⓞⓖⓐⓣⓞⓡⓢ

Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".

With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:

Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.

Mis(re-)direction

I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html.

curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.

$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/Tl2kToC9fthiV1mM/index.html
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 19:44:59 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Location: websatchel.com/j/public/login
Content-Type: text/html

Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login

$ curl -I https://websatchel.com/j/pages/AQt5pBvSDkhPzpEt/websatchel.com/j/public/login
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 20:13:04 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Location: websatchel.com/j/public/login

...and repeated recursively until the browser reports "Too Many Redirects".

Interacting with the Capture

Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).

"Your" Captures

I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.

Interoperability?

WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.

—Mat (@machawk1)

Monday, July 2, 2018

2018-07-02: The Off-Topic Memento Toolkit

Inspired by AlNoamany's work from "Detecting off-topic pages within TimeMaps in Web archives" I am pleased to announce an alpha release of the Off-Topic Memento Toolkit (OTMT). The results of testing with this software will be presented at iPres 2018 and those results are now available as a preprint.

Web archive collections are created with a specific purpose in mind. A curator will supply seeds for the collection and create multiple versions of these seeds in order to study the evolution of a web page over time. This is valuable for following the changes in an organization or the events in a news story. Unfortunately, depending on the curator's intent, sometimes these seeds go off-topic. Because web archive crawling software has no way to know that a page is off-topic, these mementos are added to the collection. Below I list a few examples of off-topic pages within Archive-It collections.

This memento from the Human Rights collection at Archive-It created by the Columbia University Libraries is off-topic. The page ceased to be available at some point and produced this "404 Page Not Found" response with a 200 HTTP status.

This memento from the Egypt Revolution and Politics collection at Archive-It created by the American University in Cairo is off-topic. The web site began having database problems.

It is important to note that the OTMT does not delete potentially off-topic mementos, but rather only flags them for curator review. Detecting such mementos allows us to exclude them from consideration or flag them for deletion by some downstream tool, which is important to our collection summarization and storytelling efforts. The OTMT detects these mementos using a variety of different similarity measures. One could also use the OTMT to detect and study off-topic mementos.

Installing the software

The OTMT requires Python 3.6. Once you have met that requirement, install OTMT by typing:


# pip install otmt

This installs the necessary libraries and provides the system with a new detect-off-topic command.

A simple run

To perform an off-topic run with the software on Archive-It collection 1068, type:


# detect-off-topic -i archiveit=1068 -tm cosine,bytecount -o myoutputfile.json

This will find all URI-Rs (seeds) related to Archive-It collection 1068, download their timemaps (URI-Ts), download the mementos within each timemap, process those mementos via the default similarity measures, and write the results in JSON format out to a file named outputfile.json.

The JSON output looks like the following.

Each URI-T serves as a key containing all URI-Ms within that timemap. In this example the timemap at URI-T http://wayback.archive-it.org/1068/timemap/link/http://www.badil.org/ contains several mementos. For brevity, we are only showing results for the memento at http://wayback.archive-it.org/1068/20130307084848/http://www.badil.org/.

The key "timemap measures" contains all measures run against the memento. In this case I used the two measures "cosine" and "bytecount". Each measure entry indicates which preprocessing has been performed against that memento (e.g., stemmed, tokenized, and removed boilerplate). Under "comparison score" is that measure's score. Under "topic status" is a verdict on whether or not the memento is on or off-topic. Finally, the "overall topic status" indicates if any of the measures determined that the memento is off-topic.

The OTMT uses an input-measure-output architecture. This way the tool separates the concerns of input, (e.g., how to process Archive-It collection 1068 for mementos), from measure (e.g., how to process these mementos using cosine and byte count similarity measures), and output (e.g., how to produce the output in JSON format and write it to the file outputfile.json). This architecture is extensible, providing interfaces allowing for more input types, measures, and output types to be added in the future.

The -i (for specifying the input) and -o (for specifying the output) options are the only required options. The following sections detail the different command line options available to this tool.

Input and Output

The input type is supplied by the -i option. OTMT currently supports the following input types:

an Archive-It collection ID (keyword: archiveit)
one or more TimeMap URIs (URI-T) (keyword: timemap)
one or more WARCs (keyword: warc)

An output file is supplied by the -o option. Output types are specified by the -ot option. OTMT currently supports the following output types:

JSON as shown above (the default) (keyword: json)
a comma-separated file consisting of the same content found in the JSON file (keyword: csv)

To specify multiple WARCs, list them after the warc option like so:


# detect-off-topic -i warc=mycrawl1.warc.gz,mycrawl2.warc.gz -o myoutputfile.json

Likewise, for multiple TimeMaps, list them with the timemap argument and separate their URI-Ts with commas, like so:


# detect-off-topic -i timemap=https://archive.example.org/urit/http://example.org,https://archive.example.org/urit/http://example2.org -o myoutputfile.json

To use the comma-separated file format instead of json use the -ot option as follows:


# detect-off-topic -i archiveit=3936 -o myoutputfile.csv -ot csv

For better processing, we want to eliminate any interference from HTML and JavaScript associated with archive-specific branding. In the case of TimeMaps and Archive-It collections, raw mementos will be downloaded where available. While any TimeMap may be specified for processing, raw mementos are preferred as they do not contain the additional banner information and other augmentations supplied by many web archives. These augmentations may skew the off-topic results. Currently, only raw mementos from Archive-It are detected and processed. WARC files, of course, are "raw" by their nature, so removing web-archive augmentations like banners is not needed for WARC files.

Measures

OTMT supports the following measures with the -tm (for "timemap measure") option:

Cosine Similarity of document vectors informed by TF-IDF with scikit-learn (default) (keyword: cosine)
Word Count (keyword: wordcount)
Byte Count (keyword: bytecount)
Simhash on the raw memento content with (keyword: raw_simhash)
Simhash on the term frequencies of the raw memento content (keyword: tf_simhash)
Jaccard Distance (keyword: jaccard)
Sørensen-Dice Distance (keyword: sorensen)
Cosine similarity of document vectors informed by Latent Semantic Indexing with Gensim (keyword: gensim_lsi)

Each of these measures considers the first memento in a TimeMap to be on-topic and evaluates all other mementos in that TimeMap against that first memento.

Measures and thresholds can be supplied on the command line, separated by commas. For example, to use Jaccard with a threshold of 0.15, separate the measure name and the threshold value, like so:

# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15

Multiple measures can also be used, separated by commas. For example, to use jaccard and cosine similarity, type the following:


# detect-off-topic -i archiveit=3936 -o outputfile -tm jaccard=0.15,cosine=0.10

The default thresholds for these measures have been derived from testing using a gold standard dataset of on and off-topic mementos originally generated by AlNoamany. This dataset is now available at: https://github.com/oduwsdl/offtopic-goldstandard-data/. We used this dataset as a standard and selected thresholds that produced the best F₁ score for each measure. I will present the details of how we arrived at these thresholds at iPres 2018. Our study is available as a preprint available on arXiv.

Other options

Optionally, one may also change the working directory (-d) and the logging file (-l). By default, the software uses the directory /tmp/otmt-working for its work and logs to the screen with stdout.

The Future

I am still researching several features that will make it into future releases. I have separated the capabilities into library modules for use with future Python applications, but the code is currently volatile and I expect changes to come in the following months as new features are added and defects are fixed.

The software does not currently offer an algorithm utilizing the Web-based kernel function specified in AlNoamany's paper. This algorithm augments terms from the memento with terms from search engine result pages (SERPs), pioneered by Sahami and Heilman. Due to the sheer number of mementos to be evaluated by the OTMT and Google's policy on blocking requests to its SERPs, I will likely not implement this feature unless it is requested by the community.

I am also interested in the concept of "collection measures". I created the "timemap measures" key in the JSON output to differentiate one set of measure results from another eventual category of collection-wide measures that would test each memento against the topic of an entire collection. Preliminary work using the Jaccard Distance in this area was not fruitful, but I am considering other ideas.

The Off-Topic Memento Toolkit is available at https://github.com/oduwsdl/off-topic-memento-toolkit. Please give it a try and report any issues encountered and features desired. Although developed with an eye toward Archive-It collections, we hope to increase its suitability for all themed collections of archived web pages, such as personal collections created with webrecorder.io.

-- Shawn M. Jones

Tuesday, September 8, 2015

2015-09-08: Releasing an Open Source Python Project, the Services That Brought py-memento-client to Life

The LANL Library Prototyping Team recently received correspondence from a member of the Wikipedia team requesting Python code that could find the best URI-M for an archived web page based on the date of the page revision. Collaborating with Wikipedia, Harihar Shankar, Herbert Van de Sompel, Michael Nelson, and I were able to create the py-mement-client Python library to suit the needs of pywikibot.

Over the course of library development, Wikipedia suggested the use of two services, Travis CI and Pypi, that we had not used before. We were very pleased with the results of those services and learned quite a bit from the experience. We have been using GitHub for years, and also include it here as part of the development toolchain for this Python project.

We present three online services that solved the following problems for our Python library:

Where do we store source code and documentation for the long term? - GitHub
How do we ensure the project is well tested in an independent environment? - Travis CI
Where do we store the final installation package for others to use? - Pypi

We start first with storing the source code.

GitHub

As someone who is concerned about the longevity of the scholarly record, I cannot emphasize enough how important it is to check your code in somewhere safe. GitHub provides a wide variety of tools, at no cost, that allow one to preserve and share their source code.

Git and GitHub are not the same thing. Git is just a source control system. GitHub is a dedicated web site providing additional tools and hosting for git repositories.

Here are some of the benefits of just using Git (without GitHub):

Distributed authoring - many people can work separately on the same code and commit to the same place
Branching is built in, allowing different people to work on features in isolation (like unfinished support for TimeMaps)
Tagging can easily be done to annotate a commit for release
Many IDEs and other development tools support Git out of the box
Ease of changing remote git repositories if switching from one site to another is required
Every git clone is actually a copy of the master branch of the repository and all of its history, talk about LOCKSS!!!

That last one is important. It means that all one needs to do is clone a git repository and they now have a local archive of that repository branch, with complete history, at the time of cloning. This is in contrast to other source control systems, such as Subversion, where the server is the only place storing the full history of the repository. Using git avoids this single point of failure, allowing us to still have a archival copy, including history, in the case that our git local server or GitHub goes away.

Here are some of the benefits of using GitHub:

Collaboration with others inside and outside of the project team, through the use of pull requests, code review, and an issue tracker
Provides a GUI for centralizing and supporting the project
Allows for easy editing of documentation using Markdown, and also provides a wiki, if needed
The wiki can also be cloned as a Git repository for archiving!
Integrates with a variety of web services, such as Travis CI
Provides release tools that allow adding of release notes to tags while providing compiled downloads for users
Provides a pretty-parsed view of the code where quick edits can be made on the site itself
Allows access from multiple Internet-connected platforms (phone, tablet, laptop, etc.)
And so much more that we have not yet explored....

We use GitHub for all of these reasons and we are just scratching the surface. Now that we have our source code centralized, how do we independently build and test it?

Travis CI

Travis CI provides a continuous integration environment for code. In our case, we use it to determine the health of the existing codebase.

We use it to evaluate code for the following:

Does it compile? - tests for syntax and linking errors
Can it be packaged? - tests for build script and linking errors
Does it pass automated tests? - tests that the last changes have not broken functionality

Continuous integration provides an independent test of the code. In many cases, developers get code to work on their magic laptop or their magic network and it works for no one else. Continuous Integration is an attempt to mitigate that issue.

Of course, far more can be done with continuous integration, like publish released binaries, but with our time and budget, the above is all we have done thus far.

Travis CI provides a free continuous integration environment for code. It easily integrates with GitHub. In fact, if a user has a GitHub account, logging into Travis CI will produce a page listing all GitHub projects that they have access to. To enable a project for building, one just ticks the slider next to the desired project.

It then detects the next push to GitHub and builds the code based on the a .travis.yml file, if present in the root of the Git repository.

The .travis.yml file has a relatively simple syntax whereby one specifies the language, language version, environment variables, pre-requisite requirements, and then build steps.

Our .travis.yml looks as follows:

language: python
cache: # caching is only available for customers who pay
    directories:
        - $HOME/.cache/pip
python:
    - "2.7"
    - "3.4"
env:
    - DEBUG_MEMENTO_CLIENT=1
install: 
    - "pip install requests"
    - "pip install pytest-xdist"
    - "pip install ."
script:
    - python setup.py test
    - python setup.py sdist bdist_wheel
branches:
    only:
        - master

The language section tells Travis CI which language is used by the project. Many languages are available, including Ruby and Java.

The cache section allows caching of installed library dependencies on the server between builds. Unfortunately, the cache section is only available for paid customers.

The python section lists for which versions of Python the project will be built. Travis CI will attempt a parallel build in every version specified here. The Wikimedia folks wanted our code to work with both Python 2.7 and 3.4.

The env section contains environment variables for the build.

The install section runs any commands necessary for installing additional dependencies prior to the build. We use it in this example to install dependencies for testing. In the current version this section is removed because we now handle dependencies directly via Python's setuptools, but it is provided here for completeness.

The script section is where the actual build sequence occurs. This is where the steps are specified for building and testing the code. In our case, Python needs no compilation, so we skip straight to our automated tests before doing a source and binary package to ensure that our setup.py is configured correctly.

Finally, the branches section is where one can indicate additional branches to build. We only wanted to focus on master for now.

There is extensive documentation indicating what else one can do with .travis.yml.

Once changes have have pushed to GitHub, Travis CI detects the push and begins a build. As seen below, there are two builds for py-memento-client: for Python 2.7 and 3.4.

Clicking on one of these boxes allows one to watch the results of a build in real time, as shown below. Also present is a link allowing one to download the build log for later use.

All of the builds that have been performed are available for review. Each entry contains information about the the commit, including who performed the commit, as well as how long it took, when it took place, how many tests passed, and, most importantly, if it was successful. Status is indicated by color: green for success, red for failure, and yellow for in progress.

Using Travis CI we were able to provide an independent sanity check on py-memento-client, detecting test data that was network-dependent and also eliminating platform-specific issues. We developed py-memento-client on OSX, tested it at LANL on OSX and Red Hat Enterprise Linux, but Travis CI runs on Ubuntu Linux so we now have confidence that our code performs well in different environments.

Closing thought: all of this verification only works as well as the automated tests, so focus on writing good tests. :)

Pypi

Finally, we wanted to make it straightforward to install py-memento-client and all of its dependencies:

pip install memento_client

Getting there required Pypi, a site that globally hosts Python projects (mostly libraries). Pypi not only provides storage for built code so that others can download it, it also requires that metadata be provided so that others can see what functionality the project provides. Below is an image of the Pypi splash page for the py-memento-client.

Getting support for Pypi and producing the data for this splash page required that we use Python setuptools for our build. Our setup.py file, inspired by Jeff Knupp's "Open Sourcing a Python Project the Right Way", provides support for a complete build of the Python project. Below we highlight the setup function that is the cornerstone of the whole build process.

setup(
    name="memento_client",
    version="0.5.1",
    url='https://github.com/mementoweb/py-memento-client',
    license='LICENSE.txt',
    author="Harihar Shankar, Shawn M. Jones, Herbert Van de Sompel",
    author_email="prototeam@googlegroups.com",
    install_requires=['requests>=2.7.0'],
    tests_require=['pytest-xdist', 'pytest'],
    cmdclass={
        'test': PyTest,
        'cleanall': BetterClean
        },
    download_url="https://github.com/mementoweb/py-memento-client",
    description='Official Python library for using the Memento Protocol',
    long_description="""
The memento_client library provides Memento support, as specified in RFC 7089 (http://tools.ietf.org/html/rfc7089)
For more information about Memento, see http://www.mementoweb.org/about/.
This library allows one to find information about archived web pages using the Memento protocol.  It is the goal of this library to make the Memento protocol as accessible as possible to Python developers.
""",
    packages=['memento_client'],
    keywords='memento http web archives',
    extras_require = {
        'testing': ['pytest'],
        "utils": ["lxml"]
    },
    classifiers=[

        'Intended Audience :: Developers',

        'License :: OSI Approved :: BSD License',

        'Operating System :: OS Independent',

        'Topic :: Internet :: WWW/HTTP',
        'Topic :: Scientific/Engineering',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Topic :: Utilities',

        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3.4'
    ]
)

Start by creating this function call to setup, supplying all of these named arguments. Those processed by Pypy are name, version, url, license, author, download_url, description, long_description, keywords, and classifiers. The other arguments are used during the build to install dependencies and run tests.

The name and version arguments are used as the title for the Pypi page. They are also used by those running pip to install the software. Without these two items, pip does not know what it is installing.

The url argument is interpreted by Pypi as Home Page and will display on the web page using that parameter.

The license argument is used to specify how the library is licensed. Here we have a defect, we wanted users to refer to our LICENSE.txt file, but Pypi interprets it literally, printing License: LICENSE.txt. We may need to fix this.

The author argument maps to the Pypi Author field and will display literally as typed, so commas are used to separate authors.

The download_url argument maps to the Pypi Download URL field.

The description argument becomes the subheading of the Pypi splash page.

The long_description argument becomes the body text of the Pypi splash page. All URIs become links, but attempts to put HTML into this field produced a spash page displaying HTML, so we left it as text until we required richer formatting.

The keywords argument maps to the Pypi Keywords field.

The classifiers argument maps to the Pypi Categories field. When choosing classifiers for a project, use this registry. This field is used to index the project on Pypi to make finding it easier for end user.

For more information on what goes into setup.py, check out "Packaging and Distributing Projects" and "The Python Package Index (PyPI)" on the Python.org site.

Once we had our setup.py configured appropriately, we had to register for an account with Pypi. We then created a .pypirc file in the builder's home directory with the contents shown below.

[distutils]
index-servers =
    pypi

[pypi]
repository: https://pypi.python.org/pypi
username: hariharshankar
password: <password>

The username and password fields must both be present in this file. We encountered a defect while uploading the content whereby the setuptools did not prompt for the password if it was not present and the download failed.

Once that is in place, use the existing setup.py to register the project from the project's source directory:

python setup.py register

Once that is done, the project show up on the Pypi web site under the Pypi account. After that, publish it by typing:

python setup sdist upload

And now it will show up on Pypi for others to use.

Of course, one can also deploy code directly to Pypi using Travis CI, but we have not yet attempted this.

Conclusion

Open source development has evolved quite a bit over the last several years. The first successful achievement being sites such as Freshmeat (now defunct) and SourceForge, providing free repositories and publication sites for projects. GitHub fulfills this role now, but developers and researchers need more complex tools.

Travis CI, coupled with good automated tests, allows independent builds, and verification that software works correctly. It ensures that a project not only compiles for users, but also passes functional tests in an independent environment. As noted, one can even use it to deploy software directly.

Pypi is a Python-specific repository of Python libraries and other projects. It is the backend of the pip tool commonly used by Python developers to install libraries. Any serious Python development team should consider the use of Pypi for hosting and providing easy access to their code.

Using these three tools, we not only developed py-memento-client in a small amount of time, but we also independently tested, and published that library for others to enjoy.

--Shawn M. Jones
Graduate Research Assistant, Los Alamos National Laboratory
PhD Student, Old Dominion University

Friday, August 21, 2015

2015-08-20: ODU, L3S, Stanford, and Internet Archive Web Archiving Meeting

Two weeks ago (on Aug 3, 2015), I was glad to be invited to visit Internet Archive in San Francisco in order to share our latest work with a set of the Web Archiving pioneers from around the world.

The attendees were Jefferson Bailey and Vinay Goel from IA, Nicholas Taylor and Ahmed AlSum from Stanford, and Wolfgang Nejdl, Ivana Marenzi and Helge Holzmann from L3S.

First, we took a quick introduction to each others mentioning the purpose and the nature of our work to IA.

Then, Nejdl introduced the Alexandria project, and demoed the ArchiveWeb project, which aims to develop tools and techniques to explore and analyze Web archives in a meaningful way. In the project, they develop tools that will allow users to visualize and collaboratively interact with Archive-it collections by adding new resources in the form of tags and comments. Furthermore, it contains a collaborative search and sharing platform.

I presented the off-topic detection work with a live demo for the tool, which can be downloaded and tested from https://github.com/yasmina85/OffTopic-Detection.

Detecting Off-Topic Pages in Web Archives from Yasmina Anwar

The off-topic tool aims to automatically detect when the archived page goes off-topic, which means the page changed through time to move away from the initial scope of the page. The tool suggests a list of off-topic pages based on a specific threshold that is input by the user. Based on evaluating the tool, we suggest values for the threshold in a research paper* that can be used to detect the off-topic pages.

A site for one of the candidates for Egypt’s 2012 presidential election. Many of the captures of hamdeensabhay.com are not about the Egyptian Revolution. Later versions show an expired domain (as does the live Web version).

Examples for the usage of the tool:
--------

Example 1: Detecting off-topic pages in 1826 collection

python detect_off_topic.py -i 1826 -th 0.15
extracting seed list
…
http://agroecol.umd.edu/Research/index.cfm
http://casademaryland.org
…
50 URIs are extracted from collection https://archive-it.org/collections/1826
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://agroecol.umd.edu/Research/index.cfm
Downloading timemap using uri http://wayback.archive-it.org/1826/timemap/link/http://casademaryland.org
…
Downloading 4 mementos out of 306
Downloading 14 mementos out of 306
…
Detecting off-topic mementos using Cosine Similarity method

Similarity memento_uri
0.0 http://wayback.archive-it.org/1826/20131220205908/http://www.mncppc.org/commission_home.html/
0.0 http://wayback.archive-it.org/1826/20141118195815/http://www.mncppc.org/commission_home.html

Example 2: Detecting off-topic pages for http://hamdeensabahy.com/

python detect_off_topic.py -t https://wayback.archive-it.org/2358/timemap/link/http://hamdeensabahy.com/ -m wcount -th -0.85

Downloading 0 mementos out of 270
http://wayback.archive-it.org/2358/20140524131241/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130621131337/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140602131307/http://hamdeensabahy.com/
http://wayback.archive-it.org/2358/20140528131258/http://www.hamdeensabahy.com/
http://wayback.archive-it.org/2358/20130617131324/http://www.hamdeensabahy.com/

…
Downloading 4 mementos out of 270
…
Extracting text from the html
…
Detecting off-topic mementos using Word Count method

Similarity memento_uri
-0.979434447301 http://wayback.archive-it.org/2358/20121213102904/http://hamdeensabahy.com/

-0.966580976864 http://wayback.archive-it.org/2358/20130321080254/http://hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130526131402/http://www.hamdeensabahy.com/

-0.94087403599 http://wayback.archive-it.org/2358/20130527143614/http://www.hamdeensabahy.com/

Nicholas insisted on the importance of the off-topic tool from QA perspective, while Internet Archives folks focused on the required computation resources and how it can be shared with Archive-It partners. The group discussed some user interface options to display the output of the tool.

After the demo, we discussed the importance of the tool, especially in the crawling quality assurance practices. While demoing ArchiveWeb interface, some of the visualization for pages from different collections showed off-topic pages. We all agreed that it is important that those pages won’t appear to the users when they browse the collections.

It was amazing to spend time in IA and knowing about the last trend from other research groups. The discussion showed the high reputation of WS-DL research in the web archiving community around the world.

*Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, Detecting Off-Topic Pages in Web Archives, Proceedings of TPDL 2015, 2015.

----
Yasmin

Friday, October 3, 2014

2014-10-03: Integrating the Live and Archived Web Viewing Experience with Mink

UPDATE: Download the latest version of Mink here.

The goal of the Memento project is to provide a tighter integration between the past and current web. There are a number of clients now that provide this functionality, but they remain silent about the archived page until the user remembers to invoke them (e.g., by right-clicking on a link).

We have created another approach based on persistently reminding the user just how well archived (or not) are the pages they visit. The Chrome extension Mink (short for Minkowski Space) queries all the public web archives (via the Memento aggregator) in the background and will display the number of mementos (that is, the number of captures of the web page) available at the bottom right of the page. Selecting the indicator allows quick access to the mementos through a dropdown. Once in the archives, returning to the live web is as simple as clicking the "Back to Live Web" button.

For the case where there are too many mementos to make navigating an extensive list useable (think CNN.com captures), we have provided a "Miller Columns" interface that allows hierarchical navigation and is common in many operating systems (though most don't know it by name).

For the opposite case where there are no mementos for a page, Mink provides a one-click interface to submit the page to Internet Archive or Archive.today for immediate preservation and provides just-as-quick access to the archived page.

Mink can be used concurrently with Memento for Chrome, which provides a different modality of letting the user specify desired Memento-Datetime as well as reading cues provided by the HTML pages themselves. For those familiar with Memento terminology, Memento for Chrome operates on TimeGates and Mink operates on TimeMaps. We also presented a poster about Mink at JCDL 2014 in London (proceedings, poster, video).

Mink is for Chrome, free, publicly available (go ahead and try it now!), and open source (so you know there's no funny business going on).

—Mat (@machawk1)

Sunday, February 24, 2013

2013-02-24: Personal Digital Archiving 2013

On February 21-22 Justin Brunelle (@justinfbrunelle) and I (@machawk1) traveled to College Park, Maryland for Personal Digital Archiving (PDA) 2013. Other members of the Web Science and Digital Libraries Research (WS-DL) Group at ODU had previously attended this conference (see 2012 Trip Report and 2011 Trip Report), always previously at Internet Archive in San Francisco, and knew it would be informative and extremely relevant to both of research efforts.
We had both been anticipating a few of the presentations, namely the keynotes by Sally Bedell Smith and George Sanger and that Erin Engle (@erinengle) promised on the Library of Congress digital preservation blog The Signal.
For the sake of preservation, I captured videos of many of the presentations, which I posted on Internet Archive. Each available will be linked inline in this post but for a more original experience, view the videos.
As our sole mission at WS-DL is not only to document conferences (ok, admittedly, documenting conferences is not in our mission), I presented a demo and poster titled "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving". The presentation described a software package I had created called Web Archiving Integration Layer (WAIL), currently available for download and open source. The tool packages together instances of Wayback, Heritrix and other archiving tools to allow for one-click user instigated preservation or as Dr. Michael L. Nelson (@phonedude_mln) put it, "Easiest Heritrix ever".

Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving

Day One

After registration, coffee and the welcome by Bill Lefurgy (@blefurgy) and Trevor Muñoz (@trevormunoz), Sally Bedell Smith began the first keynote and talk of the conference. Sally documented her reluctance to move to modern writing programs, having been comfortable with Xywrite for the longest time and her 120+ words per minute (!) typing speed, which had previously been showcased to her colleagues. As her writing has been mostly biographical, she asked, "How do future biographers access personal media?" and the need for un-published content in the archives to be utilized" including retracted content.
"Memory is deceptive", she said, further emphasizing the need to preserve drafts. She repeatedly illustrated the advantages of paper medium over digital copies, stating that it was easier to get a 10,000 foot view when she could physically spread the pages out on a table, a perspective likely not common in the crowd of the conference bearing "Digital" in its name. "Personal correspondence is lost in digital media", she said, alluding to the context lost from interviews documented digitally." She would take these printouts and annotate them and make the digital changes. "Duplication is accepted/preferred", she said and "sharing drafts is essential".
Jenny Shaw was the first paper presentation with "Hardware and soft skills: surveying scientific personal papers in the digital age.” She spoke of her work with the Human Genome Archive Project and the group's efforts in capturing scientific notes related to the Human Genome Project. She had lead efforts in surveying the software used and the hardware needed. The primary target was for the UK's efforts and to survey Born Digital material.

After Jenny, Sudheendra Hangal (@hangal) and Monica S. Lam (@MonicaSLam) presented "Engaging users with personal archives through gamification". Sudheendra spoke of the gamification of the e-mail process through a creative fill-in-the-blank construction of a user's e-mail content in the form a crosswords and word searches. With his software, Muse, this can all be done automatically and has a primary use case in Alzheimers patients. The software instills a degree of personalization into familiar games.

After a short break, Noah Lenstra (@nlenstr2) presented "Connecting Local & Family History with Personal Digital Archiving: Findings from Studies in Four Midwestern Public Libraries". His main idea was for the need to negotiate boundaries between personal and public archives and why personal archives should be converted to public archives.

After Noah, Heather Gendron presented her talk, "Passionate About History and the Making of History: in Situ Dialogues with Artists and their Assistants about Studio Archives". She had spoken to many artist's and investigated their archiving process and feelings on whether the construction-phase of their works should be archived. She wished to publish good practice methods for artists so those with sub-par systems might learn of effective methods while still maintaining their workflow.

When Heather finished, the room was adjourned for lunch at the Banneker Room of the Stamp Student Union.

Following lunch, a series of lightning (10-minute) talks commenced. First up was Mike Ashenfelder with "The Library of Congress Personal Digital Archiving Videos". Along with Library of Congress, his group "created short, 3-5 minute videos, with clear focused messages on a single topic related to digital preservation. His intention in creating these is to reach all audiences, namely non-technical audiences." A video that his group was considering in the near future is scanning, as there has been a lot of interest in the method as a preservation means.
A few videos they have already produced are "Why Digital Preservation is Important for Everyone" and "Why Digital Preseration is Important to You" as well as Butch Lazorchak's interviews with teenagers about digital preservation. "In his interviews they had some startling realizations", Mike said, "like a teenager who said, 'when she puts something on the internet, it's always available'."

Mél Hogan (@mel_hogan) followed Mike with "Collect Yourself: Data Storage Centers as the Archive's Underbelly". Mél emphasized in her presentation (slides) the environmental impact that preservation has, highlighting the Facebook data centers in both the energy required to run the servers but additionally, the energy required to cool them.

Nigel Lepianka (@trueXstory) followed Mél with "Achievements as Personal Archives of Memory and Experience in Open World Video Games". His work has been in documenting how achievements can be used as a means of archiving how we navigate video games with work primarily done in games like World of Warcraft, which has been around long enough to represent an evolution of individuals and thus culture. As achievements are temporally organized, they show the order of experiences of the player.

Jan Emery next presented “Personal Artifacting”, a concept and practice at bringing dimensionality to personal archiving.

Following the lightning talks, Zach Vowell presented his paper, "The Many Faces of the Fat Man: A Case Study of a Multi-Faceted Personal Digital Archive." Zach's spoke of the George Sanger (the Day Two PDA2013 keynote presenter) collection, part of the UT video game archive. In the collection were numerous obscure hardware and software medium and formats, respectively, that he needed to recreate in their original form in order to preserve. One of these were Sanger's recording of digital data (namely video game audio) onto specialized VHS cassettes, which confused Zach when trying to verify the data using a VCR, as he received only static.

After Zach, Smiljana Antonijevic (@Smiljana_A) and Ellysa Stern Cahoy presented "Scholarly workflow and personal digital archiving".
They presented interviews with academic faculty to investigate scholarly workflow. The project began in 2012 and will conclude in June 2013. They conducted a web-based faculty of faculty and graduate students asking about their digital practice. The study went across the sciences, humanities, and social sciences about how faculty manage their data. Their initial question was generally, "How do faculty use their personal information collections?".
Next, Sudheendra Hangal (@hangal) returned for a second paper along with Sit Manovit, Peter Chan, and Monica S. Lam (@MonicaSLam) to present "Providing Access to Email Archives for Historical Research".

Jason Matthew Zalinger and Nathan G. Freier were the final presenters of the day with "Narrative Searching Through a Scholar’s Email Archive". Jay had been given a large corpus of e-mail from InfoVis mogul Ben Shneiderman (@benbendc), in attendance, for the sake of researching linguistic trends or indicators in an academic's professional communications. "Ben was always aware that his e-mail could become public.", Jay said. He found the keyword, "however" to be a transitional phrase that denoted much emotion in Ben's writings and illustrated examples outside of this corpus that confirmed his finding. Jay continued to "look for anger" in the corpus by finding other transitional phrases that had such an effect, relating most to moments in an academic's career where much emotion would be had (e.g., the acceptance or rejection of proposals). He finished with a quote from H. Porter Abbot, Narrative is marked almost everywhere by its lack of closure. Commonly called suspense, this lack is one of the two things that above everything else give narrative its life." He expounded, "E-mail is suspenseful."
With the closing of the paper presentations, the crowd was instructed to head to the Maryland Institute for Technology in the Humanities (MITH) for the poster session. As above, I presented my poster titled "Making Enterprise-Level Archive Tools Accessible for Personal Web Archiving" along with eleven other poster by others from a wide range of fields.

Day Two

Day two started off with a keynote from George Sanger (a.k.a "The Fat Man") George reminisced about his past with music creation, mainly for video games, and the many other endeavors he had been involved in along the way. "Archiving is like playing an electrified guitar", he said, "what's the point?". "The motivation for an archive", he later said, "is that I want the stuff gone, but I want to keep it" thus confirming its necessity.
The talks for the day started with Megan Barnard and Gabriela Redwine's presentation of "Collaborating to Improve and Protect Born-Digital Acquisitions"

The next presentation was a quasi panel humorously titled, "All Your Bits Aren't Belong To Us: Opportunities and Challenges of Personally Revealing Information in Digital Collections".

Cal Lee was the first of the panel with his theme "It's Ethics All The Way Down". Cal spoke of the levels at which information resides that might be important to document or convey the way that people interact with systems. His levels consisted of:

Aggregation of objects
Object or package
In-application rendering
File through filesystem
File as "raw" bitstream
Sub-file data structure
Bitstream through I/O equipment
Raw signal stream through I/O equipment
Bitstream on physical medium

Naomi Nelson followed Cal in the panel. She spoke of lists they had received in their collections of financial information and sensitive data. A further example were photos they received from a writer of their workspace that happen to have financial information in it. Beyond sensitive data, they discovered metadata for deleted files that was present in the collections to which the owners may not want exposed. She asked, "What do archives do with collections that include private info -- deleted files, drafts, cookies, geotags, etc."

Kam Woods (@kamwoods) spoke after Naomi as the third speaker of the panel, titled "Let There be Hope for Our Future". "The 'hope'", he described, "is that as digital materials get larger, we will have the right tools to protect donor information." His group has been building software that is relatively simple to prevent data corruption or manipulation through unintentional writing. A major part of his work was knowing that you find everything there is to find on disk when a donor submits content.

Matt Kirschenbaum (@mkirschenbaum) covered the tail end of the panel with his presentation, "Robot Historians". Matt spoke of scholarship in the context of archival donors stating, "Data has the potential, and indeed the right, to make it's call - that the stuff and matter of the cultural record is vested with agency in this negotiation. Scholarships is thus a vocation in the service of the inanimate, not just in the memory and shades of Shakespeare but their irreducible, physical remainder."

Melissa Rogers (@MelissaRogers17) was the first paper to follow the panel with "Public Displays of Affection: Digital Zine Archives and the Labor of Love". She spoke of the culture encompassed within "zines" and the natural variance, degradation and manipulation that zinester's works possess. "Those of us interested in innovative forms of zine archiving must find a way around the limited 'to digitize or not-to-digitize' argument that seems to dominate many conversation of digital zine preservation.", she said.

Seth Anderson (@AVPSeth) followed Melissa with, "Protecting the Personal Narrative: An Assessment of Archival Practice's Place in Personal Digital Archiving". Seth stated, "Collection is a natural process. ... Collecting is a way of manifesting our own existence through the materials we accumulate around ourselves we look to exist beyond our lifetimes. The materials we have represent ourselves."

After Seth, the crowd broke for lunch, only to return to a second round of lightning talks for the conference.

The first speaker of the lightning talks was Erin Engle (@erinengle) with, "We've Thought Globally, Now Let's Act Locally". She spoke about how the National Digital Information Infrastructure and Preservation Program (NDIIPP) has reached out to individuals for personal digital archiving advice. They also developed the Personal Digital Archiving Day Kit geared toward organization and institutions to share archiving guidance within their communities. The guidance supplied was non-technical in nature with suggestions like, "Identify where you have your individual files", "Organize your digital files", "Make copies, at least two, and store them in different locations".

Following Erin in the lightning talks was Philip von Stade with "Memories Lost & Found: How Digital Memories Can Help Those With Alzheimers and Their Caregivers". Philip spoke of memory loss and the use of preserved photos for keeping aging people sharp via social engagement. He is in the process of creating an iPad app that allows photos to be organized and audial annotations to be added with the use case being review of these photos and annotation to keep the aging from suffering from Alzheimers with the treatment being social engagement.

Sarah Kim was the third and last speaker of the conference's lightning talks with "The virtual presence of others and the presentation of self in personal digital archives".

Following the lightning talks, Evan Carroll (@evancarroll) presented "Law and Society: Current Advances in the Digital Afterlife". Evan discussed the idea of having a "digital executor", an option for estate planning that involved a more tech savvy person to handle digital assets. "There is a certain advantage to being dead and gone.", he said on worrying about information being exposed by those insufficiently capable of properly dealing with one's digital assets after death.
After a final break, the conference closed with Leslie Swift and Lindsay Zarwell's presentation "Projections of Life: Prewar Jewish Life on Film".

Bill and Trevor closed up the conference as they began it by asking the crowd for suggestion and comments on the conference.
Overall, Justin and I found the conference very informative and it enlightened us to some of the concerns from those in the humanities.
— Mat (@machawk1)

Monday, September 3, 2012

2012-08-31: Benchmarking LANL's SiteStory

On August 17th, 2012, Los Alamos National Laboratory's Herbert Van de Sompel announced the release of the anticipated transactional web archiver called SiteStory.

Very excited to announce the release of our SiteStory transactional archive solution #memento mementoweb.github.com/SiteStory/
— Herbert (@hvdsomp) August 17, 2012

The ODU WS-DL research group (in conjunction with The MITRE Corporation) performed a series of studies to measure the effect of the SiteStory on web server performance. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.

A sneak-peek at how SiteStory affects server performance is provided below. Please see the technical report for a full description of these results. But first, let's compare the archival behaviors of transactional and conventional Web archives.

Crawler and user visits generate archived copies of a changing page.

A visual representation of a typical page change and user access scenario is depicted in the above figure. This scenario assumes an arbitrary page that will be called P changes at inconsistent intervals. This timeline shows page P changes at points C1, C2, C3, C4, and C5 at times t2, t6, t8, t10, and t13, respectively. A user makes a request for P at points O1, O2, and O3 at times t3, t5, and t11, respectively. A Web crawler (that captures representations for storage in a Web archive) visits P at points V1 and V2 at times t4 and t9, respectively. Since O1 occurs after change C1, an archived copy of C1 is made by the transactional archive (TA). When O2 is made, P has not changed since O1 and therefore, an archived copy is not made since one already exists. The Web crawler visits V1 captures C1, and makes a copy in the Web archive. In servicing V1, an unoptimized TA will store another copy of C1 at t4 and an optimized TA could detect that no change has occurred and not store another copy of C1.

Change C2 occurs at time t6, and C3 occurs at time t8. There was no access to P between t6 and t8, which means C2 is lost -- an archived copy exists in neither the TA nor the Web crawler's archive. However, the argument can be made that if no entity observed the change, should it be archived? Change C3 occurs and is archived during the crawler's visit V2, and the TA will also archive C3. After C4, a user accessed P at O3 creating an archived copy of C4 in the TA. In the scenario depicted in Figure 1, the TA will have changes C1, C3, C4, and a conventional archive will only have C1, C3. Change C2 was never served to any client (human or crawler) and is thus not archived by either system. Change C5 will be captured by the TA when P is accessed next.

The example in the above figure demonstrates a transactional archive's ability to capture a single version of each user-observed version of a page, but does not capture versions unseen by users.

Los Alamos National Laboratory has developed SiteStory, an open-source transactional Web archive. First, mod_sitestory is installed on the Apache server that contains the content to be archived. When the Apache server builds the response for the requesting client, mod_sitestory sends a copy of the response to the SiteStory Web archive, which is deployed as a separate entity. This Web archive then provides Memento-based access to the content served by the Apache server with mod_sitestory installed, and the SiteStory Web archive is discoverable from the Apache web server using standard Memento conventions.

Sending a copy of the HTTP response to the archive is an additional task for the Apache Web server, and this task must not come at too great a performance penalty to the Web server. The goal of this study is to quantify the additional load mod_sitestory places on the Apache Web server to be archived.

ApacheBench (ab) was used to gather the throughput statistics of a server when SiteStory was actively archiving content and compare those statistics to those of the same server when SiteStory was not running. The below figures from the technical report show that SiteStory does not hinder a server's ability to provide content to users in a timely manner.

Total run time for the ab test with 10,000 connections and 1 concurrency.

Total run time for the ab test with 10,000 connections and 100 concurrency.

Total run time for the ab test with 216,000 connections and 1 concurrency.

Total run time for the ab test with 216,000 connections and 100 concurrency.

To test the effect of sites with large numbers of embedded resources, 100 HTML pages were constructed with Page 0 containing 0 embedded images, Page 1 containing 1 embedded image, .., Page n containing n embedded images. As expected, larger resources take longer to serve to a requesting user. SiteStory is affected more for larger resources, as depicted in the below figures.

As depicted in these figures, SiteStory does not significantly hinder a server, and increases the ability to actively archive content served from a server. More details on these graphs can be found in the technical report, which has been posted to arXiv.org:

Justin F. Brunelle, Michael L. Nelson, Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool, Technical Report 1209.1811v1, 2012.

--Justin F. Brunelle

Web Science and Digital Libraries Research Group