Showing posts with label 2017. Show all posts
Showing posts with label 2017. Show all posts

Sunday, January 7, 2018

2018-01-07: Review of WS-DL's 2017

The Web Science and Digital Libraries Research Group had a steady 2017, with one MS student graduated, one research grant awarded ($75k), 10 publications, and 15 trips to conferences, workshops, hackathons, internships, etc.  In the last four years (2016--2013) we have graduated five PhD and three MS students, so the focus for this year was "recruiting" and we did pick up seven new students: three PhD and four MS.  We had so many new and prospective students that Dr. Weigle and I created a new CS 891 web archiving seminar to indoctrinate introduce them to web archiving and graduate school basics.

We had 10 publications in 2017:
  • Mohamed Aturban published a tech report about the difficulties in simply computing fixity information about archived web pages (spoiler alert: it's a lot harder than you might think; blog post).  
  • Corren McCoy published a tech report about ranking universities by their "engagement" with Twitter.  
  • Yasmin AlNoamany, now a post-doc at UC Berkeley,  published two papers based on her dissertation about storytelling: a tech report about the different kinds of stories that are possible for summarizing archival collections, and a paper at Web Science 2017 about how our automatically created stories are indistinguishable from those created by experts.
  • Lulwah Alkwai published an extended version of her JCDL 2015 best student paper in ACM TOIS about the archival rate of web pages in Arabic, English, Danish, and Korean languages (spoiler alert: English (72%), Arabic (53%), Danish (35%), and Korean (32%)).
  • The rest of our publications came from JCDL 2017:
    •  Alexander published a paper about his 2016 summer internship at Harvard and the Local Memory Project, which allows for archival collection building based on material from local news outlets. 
    • Justin Brunelle, now a lead researcher at Mitre, published the last paper derived from his dissertation.  Spoiler alert: if you use headless crawling to activate all the javascript, embedded media, iframes, etc., be prepared for your crawl time to slow and your storage to balloon.
    • John Berlin had a poster about the WAIL project, which allows easily running Heritrix and the Wayback Machine on your laptop (those who have tried know how hard this was before WAIL!)
    • Sawood Alam had a proof-of-concept short paper about "ServiceWorker", a new javascript library that allows for rewriting URIs in web pages and could have significant impact on how we transform web pages in archives.  I had to unexpectedly present this paper since thanks to a flight cancellation the day before, John and Sawood were in a taxi headed to the venue during the scheduled presentation time!
    • Mat Kelly had both a poster (and separate, lengthy tech report) about how difficult it is to simply count how many archived versions of a web page an archive has (spoiler alert: it has to do with deduping, scheme transition of http-->https, status code conflation, etc.).  This won best poster at JCDL 2017!
We were fortunate to be able to travel to about 15 different workshops, conferences, hackathons:

















WS-DL did not host any external visitors this year, but we were active with the colloquium series in the department and the broader university community:
In the popular press, we had had two main coverage areas:
  • RJI ran three separate articles about Shawn, John, and Mat participating in the 2016 "Dodging the Memory Hole" meeting. 
  • On a less auspicious note, it turns out that Sawood and I had inadvertently uncovered the Optionsbleed bug three years ago, but failed to recognize it as an attack. This fact was covered in several articles, sometimes with the spin of us withholding or otherwise being cavalier with the information.
We've continued to update existing and release new software and datasets via our GitHub account. Given the evolving nature of software and data, sometimes it can be difficult a specific release date, but this year our significant releases and updates include:
For funding, we were fortunate to continue our string of eight consecutive years with new funding.  The NEH and IMLS awarded us a $75k, 18 month grant, "Visualizing Webpage Changes Over Time", for which Dr. Weigle is the PI and I'm the Co-PI.  This is an area we've recognized as important for some time and we're excited to have a separate project dedicated to the visualizing archived web pages. 

Another point you can probably infer from the discussion above but I decided to make explicit is that we're especially happy to be able to continue to work with so many of our alumni.  The nature of certain jobs inevitably takes some people outside of the WS-DL orbit, but as you can see above in 2017 we were fortunate to continue to work closely with Martin (2011) now at LANL, Yasmin (2016) now at Berkeley, and Justin (2016) now at Mitre.  

WS-DL annual reviews are also available for 2016, 2015, 2014, and 2013.  Finally, I'd like to thank all those who at various conferences and meetings have complimented our blog, students, and WS-DL in general.  We really appreciate the feedback, some of which we include below.

--Michael











Wednesday, November 22, 2017

2017-11-22: Deploying the Memento-Damage Service





Many web services such as archive.isArchive-ItInternet Archive, and UK Web Archive have provided archived web pages or mementos for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception.

Related to this, Justin Brunelle devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedded video (e.g., a Youtube page). Even though the first page is missing more resources, intuitively the second page is more damaged and less useful for users. The damage value ranges from 0 to 1, where a damage of 1 means the web page lost all of its embedded resources. Figure 1 gives an illustration of how this prototype works.

Figure 1. The overview flowchart of Memento Damage
Although this prototype has been successfully proven to be capable of measuring the damage, it is not user ready. Thus, we implemented a web service, called Memento-Damage, based on the prototype.

Analyzing the Challenges

Reducing the Calculation Time

As previously explained, the basic notion of damage calculation is mirroring human perception of a memento. Thus, we analyze the screenshot of the web page as a representation of how the page looks in the user's eyes. This screenshot analysis takes the most time of the entire damage calculation process.

The initial prototype is built on top of the Perl programming language and used PerlMagick to analyze the screenshot. This library dumps the color scheme (RGB) of each pixel in the screenshot into a file. This output file will then be loaded by the prototype for further analysis. Dumping and reading the pixel colors of the screenshot take a significant amount of time and it is repeated according to the number of stylesheets the web page has. Therefore, if a web page has 5 stylesheets, the analysis will be repeated 5 times even though it uses the same screenshot as the basis.

Simplifying the Installation and Making It Distributable
Before running the prototype, users are required to install all dependencies manually. The list of dependencies is not provided. Users must discover it themselves by identifying the error that appears during the execution. Furthermore, we need to ‘package’ and deploy this prototype into a ready-to-use and distributable tool that can be used widely in various communities. How? By providing 4 different ways of using the service:
Solving Other Technical Issues
Several technical issues that needed to be solved included:
  1. Handling redirection (status_code  = 301, 302, or 307).
  2. Providing some insights and information.
    The user will not only get the final damage value but also will be informed about the detail of the process that happened during the crawling and calculation process as well as the components that make up the value of the final damage. If an error happened, the error info will also be provided.  
  3. Dealing with overlapping resources and iframes. 

Measuring the Damage

Crawling the Resources
When a user inputs a URI-M into the Memento-Damage service, the tool will check the content-type of the URI-M and crawl all resources. The properties of the resources, such as size and position of an image, will be written into a log file. Figure 2 summarizes the crawling process conducted in Memento-Damage. Along with this process, a screenshot of the website will also be created.

Figure 2. The crawling process in Memento-Damage

Calculating the Damage
After crawling the resources, Memento-Damage will start calculating the damage by reading the log files that are previously generated (Figure 3). Memento-Damage will first read the network log and examine the status_code of each resource. If a URI is redirected (status_code = 301 or 302), it will chase down the final URI by following the URI in the header location as depicted in Figure 4. Each resource will be processed in accordance with its type (image, css, javascript, text, iframe) to obtain its actual and potential damage value. Then, the total damage is computed using the formula:
\begin{equation}\label{eq:tot_dmg}T_D = \frac{A_D}{P_D}\end{equation}
where:
     $ T_D = Total Damage \\
        A_D = Actual Damage \\
        P_D = Potential Damage $

The formula above can be further elaborated into:
$$ T_D = \frac{A_{D_i} + A_{D_c} + A_{D_j} + A_{D_m} + A_{D_t} + A_{D_f}}{P_{D_i} + P_{D_c} + P_{D_j} + P_{D_m} + P_{D_t} + P_{D_f}} $$
where each subscript notation represent image (i), css (c), javascript (j), multimedia (m), text, and iframe (f), respectively. 
For image analysis, we use Pillow, a python image library that has better and faster performance than PerlMagick. Pillow can read pixels in an image without dumping it into a file to speed up the analysis process. Furthermore, we modify the algorithm so that we only need to run the analysis script once for all stylesheets.
Figure 3. The calculation process in Memento-Damage

Figure 4. Chasing down a redirected URI

Dealing with Overlapping Resources

Figure 5. Example of a memento that contains overlapping resources (accessed on March 30th, 2017)
URIs with overlapping resources such the one illustrated in Figure 5 need to be treated differently to prevent the damage value from being double counted. To solve this problem, we created a concept of rectangle (Rectangle = xmin, ymin, xmax, ymax). We perceive the overlapping resources as rectangles and calculate the size of the intersection area. The size of one of the overlapped resource will be reduced by the intersection size, while the other overlapped resource will get the whole full size. Figure 6 and Listing 1 give the illustration of the rectangle concept.
Figure 6. Intersection concept for overlapping resources in an URI
def rectangle_intersection_area(a, b):
    dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
    dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
    if (dx >= 0) and (dy >= 0): 
        return dx * dy
Listing 1. Measuring image rectangle in Memento-Damage

Dealing with Iframes

Dealing with iframe is quite tricky and requires some customization. First, by default, crawling process cannot access content inside of iframe using native javascript or JQuery selector due to a cross-domain problem. This problem becomes more complicated when this iframe is nested in another iframe(s). Therefore, we need to find a way to switch from main frame to the iframe. To handle this problem, we utilize the API provided by PhantomJS that facilitates switching from one iframe to another. Second, the location properties of the resources inside of iframe are calculated relative to that particular iframe position, not to the main frame position. It could lead to a wrong damage calculation. Thus, for a resource located inside an iframe, its position must be computed in a nested calculation by taking into account the position of its parent frame(s)

Using Memento-Damage

The Web Service

a. Website
The Memento-Damage website gives the easiest way to use the Memento-Damage tool. However, since it runs on a resource-limited server provided by ODU, it is not recommended for calculating damage a large number of requests. Figure 7 shows a brief preview of the website.
Figure 7. The calculation result from Memento-Damage

b. REST API
The REST API service is part of the web service which facilitates damage calculation from any HTTP Client (e.g. web browser, CURL, etc) and gives output in JSON format. This makes it possible for the user to do further analysis with the resulting output. Using REST API, a user can create a script and calculate damage on a few number of URIs (e.g. 5).
The default REST API usage for memento damage is:
http://memento-damage.cs.odu.edu/api/damage/[the input URI-M]
Listing 2 and Listing 3 show examples of using Memento-Damage REST API with CURL and Python.
curl http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci
Listing 2. Using Memento-Damage REST API with Curl
import requests
resp = requests.get('http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci')
print resp.json()
Listing 3. Using Memento-Damage REST API as embedded code in Python

Local Service

a. Docker Version
The Memento-Damage docker image uses Ubuntu-LXDE for the desktop environment. A fixed desktop environment is used to avoid an inconsistency issue of the damage value of the same URI run by different machines with different operating systems. We found that PhantomJS, the headless browser that is used for generating the screenshot, rendered the web page in accordance with the machine's desktop environment. Hence, the same URI could have a slightly different screenshot, and thus different damage values when run on different machines (Figure 8). 


Figure 8. Screenshot of https://web.archive.org/web/19990125094845/http://www.dot.state.al.us taken by PhantomJS run on 2 machines with different OS.
To start using the Docker version of Memento-Damage, the user can follow these  steps: 
  1. Pull the docker image:
    docker pull erikaris/memento-damage
    
  2. Run the docker image:
    docker run -it -p :80 --name  
    
    Example:
    docker run -i -t -p 8080:80 --name memdamage erikaris/memento-damage:latest
    
    After this step is completed, we now have the Memento-Damage web service running on
    http://localhost:8080/
    
  3. Run memento-damage as a CLI using the docker exec command:
    docker exec -it <container name> memento-damage <URI>
    Example:
    docker exec -it memdamage memento-damage http://odu.edu/compsci
    
    If the user wants to work from the inside of the Docker's terminal, use this following command:
    docker exec -it <container name> bash
    Example:
    docker exec -it memdamage bash 
  4. Start exploring the Memento-Damage using various options (Figure 9) that can be obtained by typing
    docker exec -it memdamage memento-damage --help
    or if the user is already inside the Docker's container, just simply type:
    memento-damage --help
~$ docker exec -it memdamage memento-damage --help
Usage: memento-damage [options] <URI>

Options:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir=OUTPUT_DIR
                        output directory (optional)
  -O, --overwrite       overwrite existing output directory
  -m MODE, --mode=MODE  output mode: "simple" or "json" [default: simple]
  -d DEBUG, --debug=DEBUG
                        debug mode: "simple" or "complete" [default: none]
  -L, --redirect        follow url redirection

Figure 9. CLI options provided by Memento-Damage

Figure 10 depicts an output generated by CLI-version Memento-Damage using complete debug mode (option -d complete).
Figure 10. CLI-version Memento-Damage output using option -d complete
Further details about using Docker to run Memento-Damage is available on http://memento-damage.cs.odu.edu/help/.

b. Library
The library version offers functionality (web service and CLI) that is relatively similar to that of the Docker version. It is aimed at the people who already have all the dependencies (PhantomJS 2.xx and Python 2.7) installed on their machine and do not want to bother installing Docker. The latest library version can be downloaded from GitHub.
Start using the library by following these steps:
  1. Install the library using the command:
    sudo pip install web-memento-damage-X.x.tar.gz
  2. Run the Memento-Damage as a web service:     memento-damage-server
  3. Run the Memento-Damage via CLI:   memento-damage <URI>
  4. Explore available options by using option --help:
    memento-damage-server --help                     (for the web service)
    or 
    memento-damage --help                                  (for CLI)

Testing on a Large Number of URIs

To prove our claim that Memento-Damage can handle a large number of URIs, we conducted a test on 108,511 URI-Ms using a testing script written in Python. The testing used the Docker version of Memento-Damage that was run on a machine with specification: Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz, Memory 4 GiB. The testing result is shown below.

Summary of Data
=================================================
Total URI-M: 108511
Number of URI-M are successfully processed: 80580
Number of URI-M are failed to process: 27931
Number of URI-M has not processed yet: 0

With a dataset of 108,511 input URI-Ms tested, we found 80,580 URI-Ms were successfully processed while the rest, 27,931 URI-Ms, failed to process. The processing failure on those 27,931 URI-Ms happened because of the concurrent limitation access from Internet Archive. On average, 1 URI-M needs 32.5 seconds processing time. This is 110 times faster than the prototype version, which takes an average of 1 hour to process 1 URI-M. In some cases, the prototype version even took almost 2 hours to process 1 URI-M. 

From those successfully processed URI-Ms, we managed to create some visualizations to help us better understand the result as can be seen below. The first graph (Figure 11) shows the number of average missing embedded resources per memento per year according to the damage value (Dm) and the missing resources (Mm). The most interesting case appeared in 2011 where the Dm value was significantly higher than Mm. It means, although on the average the URI-Ms in 2011 only lost 4% of their resources, these losses actually caused 4 (four) times higher damages than the number showed by Mm. On the other hand, in 2008, 2010, 2013, and 2017, Dm value is lower than Mm, which implies those missing resources are less important.
Figure 11. The average embedded resources missed per memento per year
Figure 12. Comparison of All Resources vs Missing Resources

The second graph (Figure 12) shows the number of total resources in each URI-Ms and its missing resources.  The x-axis represents each URI-M sorted in descending order by the number of resources, while the y-axis represents the number of resources owned by each URI-M. This graph gives us an insight that almost every URI-M lost at least one of its embedded resources. 

Summary

In this research, we have improved the calculation method for measuring the damage on a memento (URI-M) based on the prototype from the previous version. The improvements include reducing calculation time, fixing various bugs, handling redirection and a new type of resources. We successfully developed the Memento-Damage into a comprehensive tool that has the ability to show the detail of every resource that contributes to the damage. Furthermore, it also provides several approaches for utilizing the tool such as python library and the Docker version. The testing result shows that Memento-Damage works faster compared to the prototype version and can handle a larger number of mementos. Table 1 summarizes the improvements that we made on Memento-Damage compared to the initial prototype. 

No
Subject
Prototype
Memento Damage
1.Programming LanguageJavascript + PerlJavascript + Python
2.InterfaceCLICLI
Website
REST API
3.DistributionSource CodeSource Code
Python library
Docker
4.OutputPlain TextPlain Text
JSON
5.Processing timeVery slowFast
6.Includes IFrameNAAvailable
7.Redirection HandlingNAAvailable
8.Resolve OverlapNAAvailable
9.Blacklisted URIsOnly has 1 blacklisted URI which is added manuallyAdd several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.Batch executionNot supportedSupported
11.DOM selector capabilityonly support simple selection querysupport complex selection query
12.Input filteringNAOnly process an input of HTML format
Table 1. Improvement on Memento-Damage compared to the initial prototype

Help Us to Improve

This tool still needs a lot of improvements to increase its functionality and provide a better user experience. We really hope and strongly encourage everyone, especially people who work in web archiving field, to try this tool and give us feedback. Please kindly read the FAQ and HELP before starting using the Memento-Damage. Help us to improve and tell us what we can do better by posting any bugs, errors, issues, or difficulties that you find in this tool on our GitHub

- Erika Siregar -

Thursday, November 16, 2017

2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

While researching my dissertation topic, I re-encountered the paper, "Routing Memento Requests Using Binary Classifiers" by Bornand, Balakireva, and Van de Sompel from JCDL 2016 (arXiv:1606.09136v1). The high-level gist of this paper is that by using two corpora of URI-Rs consisting of requests to their Memento aggregator (one for training, the other for training evaluation), the authors were able to significantly mitigate wasted requests to archives that contained no mementos for a requested URI-R.

For each of the 17 Web archives included in the experiment, with the exception of the Internet Archive on the assumption that a positive result would always be returned, a classifier was generated. The classifiers informed the decision of, given a URI-R, whether the respective Web archive should be queried.

Optimization of this sort has been performed before. For example, AlSum et al. from TPDL 2013 (trip report, IJDL 2014, and arXiv) created profiles for 12 Web archives based on TLD and showed that it is possible to obtain a complete TimeMap for 84% of the URI-Rs requested using only the top 3 archives. In two separate papers from TPDL 2015 (trip report) then TPDL 2016 (trip report), Alam et al. (2015, 2016) described making routing decisions when you have the archive's CDX information and when you have to use the archive's query interface to expose its holdings (respectively) to optimize queries.

The training data set was based off of the LANL Memento Aggregator cache from September 2015 containing over 1.2 million URI-Rs. The authors used Receiver Operating Characteristic (ROC) curves comparing the rate of false positives (URI-R should not have been included but was) to the rate of true positives (URI-R was rightfully included in the classification). When requesting a prediction from the classifier once training, a pair of each of these rates is chosen corresponding to the most the most acceptable compromise for the application.

A sample ROC curve (from the paper) to visualize memento requests to an archive.

Classification of this sort required feature selection, of which the authors used character length of the URI-R and the count of special characters as well as the Public Suffix List domain as a feature (cf. AlSum et al.'s use of TLD as a primary feature). The rationale in choosing PSL over TLD was because of most archiving covering the same popular TLDs. An additional token feature was used by parsing the URI-R, removing delimiters to form tokens, and transforming the tokens to lowercase.

The authors used four different methods to evaluating the ranking of the features being explored for the classifiers: frequency over the training set, sum of the differences between feature frequencies for a URI-R and the aforementioned method, Entropy as defined by Hastie et al. (2009), and the Gini impurity (see Breiman et al. 1984). Each metric was evaluated to determine how it affected the prediction by training a binary classifier using the logistic regression algorithm.

The paper includes applications of the above plots for each of the four feature selection strategies. Following the training, they evaluated the performance of each algorithm, with a preference toward low computation load and memory usage, for classification using correspond sets of selected features. The algorithms evaluated were logistical regression (as used before, Multinomial Bayes, Random Forest, and SVM. Aside from Random Forest, the other three algorithms had similar runtime predictions, so were evaluated further.

A classifier was trained using each permutation of the three remaining algorithms and each archive. To determine the true positive threshold, they brought in the second data set consisting of 100,000 unrelated URI-Rs from the Internet Archive's log files from early 2012. Of the three algorithms, they found that logistic regression performed the best for 10 archives and Multinomial Bayes for 6 others (per above, IA was excluded).

The authors then evaluated the trained classifiers using yet another dataset of URI-Rs from 200,000 randomly selected requests (cleaned to just over 187,000) from oldweb.today. Given the data set was based on inter-archive requests, it was more representative of that of an aggregator's requests compared to the IA dataset. They computed recall, computational cost, and response time using a simulated method to prevent the need for thousands of requests. These results confirmed that the currently used heuristic of querying all archives has the best recall (results are comprehensive) but response time could be drastically reduced using a classifier. With a reduction in recall of 0.153, less than 4 requests instead of 17 would reduce the response time from just over 3.7 seconds to about 2.2 seconds. Additional details of optimization obtained through evaluation of the true positive rate can be had in the paper.

Take Away

I found this paper to be an interesting an informative read on a very niche topic that is hyper-relevant to my dissertation topic. I foresee a potential chance to optimize archival query from other Memento aggregators like MemGator and look forward to further studies is this realm on both optimization and caching.

Mat (@machawk1)

Nicolas J. Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. "Routing Memento Requests Using Binary Classifiers," In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL), pp. 63-72, (Also at arXiv:1606.09136).

Monday, June 26, 2017

2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

Mat Kelly reports on the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017 in London, England.                            

In the latter part of Web Archiving Week (#waweek2017) from Wednesday to Friday, Sawood and I attended the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017, held jointly with the RESAW Conference at the Senate House and British Library Knowledge Center in London. Each of the three days had multiple tracks. Reported here are the presentations I attended.

Prior to the keynote, Jane Winters (@jfwinters) of University of London and Nicholas Taylor (@nullhandle) welcomed the crowd with admiration toward the Senate House venue. Leah Lievrouw (@Leah53) from UCLA then began the keynote. In her talk, she walked through the evolution of the Internet as a medium to access information prior to and since the Web.

With reservation toward the "Web 3.0" term, Leah described a new era in the shift from documents to conversations, to big data. With a focus toward the conference, Leah described the social science and cultural break down as it has applied to each Web era.

After the keynote, two concurrent presentation tracks proceeded. I attended a track where Jefferson Bailey (@jefferson_bail) presented "Advancing access and interface for research use of web archives". First citing an updated metric of the Internet Archive's holdings (see Ian's tweet below), Jefferson provided a an update on some contemporary holdings and collections by IA inclusive of some of the details on his GifCities project (introduced with IA's 20th anniversary, see our celebration), which provides searchable access to the the archive's holdings of the animated GIFs that once resided on Geocities.com.

In addition to this, Jefferson also highlighted the beta features of the Wayback Machine, inclusive of anchor text-based search algorithm, MIME-type breakdown, and much more. He also described some other available APIs inclusive of one built on top of WAT files, a metadata format derived from WARC.

Through recent efforts by IA for their anniversary, they also had put together a collection of military PowerPoint slide decks.

Following Jefferson, Niels Brügger (@NielsBr) lead a panel consisting of a subset of authors from the first issue of his journal, "Internet Histories". Marc Weber stated that the journal had been in the works for a while. When he initially told people he was looking at the history of the Web in the 1990s, people were puzzled. He went on to compare the Internet to be in its Victorian era as evolved from 170 years of the telephone and 60 years of being connected through the medium. Of the vast history of the Internet we have preserved relatively little. He finished with noting that we need to treat history and preservation as something that should be done quickly, as we cannot go back later to find the materials if they are no preserved.

Steve Jones of University of Illinois at Chicago spoke second about the Programmed Logic for Automatic Teaching Operations (PLATO) system. There were two key interests, he said, in developing for PLATO -- multiplayer games and communication. The original PLATO lab was in a large room and because of laziness, they could not be bothered to walk to each other's desks, so developed the "Talk" system to communicate and save messages so the same message would not have to be communicated twice. PLATO was not designed for lay users but for professionals, he said, but was also used by university and high school students. "You saw changes between developers and community values," he said, "seeing development of affordances in the context of the discourse of the developers that archived a set of discussions." Access to the PLATO system is still available.

Jane Winters presented third on the panel stating that there is a lot of archival content that has seen little research engagement. This may be due to continuing work on digitizing traditional texts but it is hard to engage with the history of the 21st century without engaging with the Web. The absence of metadata is another issue. "Our histories are almost inherently online", she said, "but they only gain any real permanence through preservation in Web archives. That's why humanists and historians really need to engage with them."

The tracks then joined together for lunch and split back into separate sessions, where I attended the presentation, "A temporal exploration of the composition of the UK Government Web Archive". In this presentation they examined the evolution of the UK National Archives (@uknatarchives). This was followed by a presentation by Caroline Nyvang (@caobilbao) of the Royal Danish Library that examined current web referencing practices. Her group proposed the persistent web identifier (PWID) format for referencing Web archives, which was eerily familiar to the URI semantics often used in another protocol.

Andrew (Andy) Jackson (@anjacks0n) then took the stage to discuss the UK Web Archive's (@UKWebArhive) catalog and challenges they have faced while considering the inclusion of Web archive material. He detailed a process, represented by a hierarchical diagram, to describe the sorts of transformations required in going from the data to reports and indexes about the data. In doing so, he also juxtaposed and compared his process with other archival workflows that would be performed in a conventional library catalog architecture.

Following Andy, Nicola Bingham (@NicolaJBingham) discussed curating collections at the UK Web Archive, which has been archiving since 2013, and challenges in determine the boundaries and scope of what should be collected. She encouraged researchers to engage to shape their collections. Their current holdings consist of about 400 terabytes with 11 to 12 billion records, growing 60 to 70 terabytes and 3 billion records per year. Their primary mission is to collect UK web sites under UK TLDs (like .uk, .scot, .cymru, etc). Domains are currently capped at 512 megabytes being preserved but even then other technical limitations exists in capture like proprietary formats, plugins, robots.txt, etc).

When Nicola finished, there was a short break. Following that, I traveled upstairs of the Senate House to the "Data, process, and results" workshop, lead by Emily Maemura (@emilymaemura). She first described three different research projects where each of the researchers were present and asked attendees to break out into groups to discuss the various facets of each project in detail with each researcher. I opted to discuss Frederico Nanni's (@f_nanni) work with him and a group of other attendees. His work consisted of analyzing and resolving issues in the preservation of the web site of the University of Bologna. The site specifies a robots.txt exclusion, which makes the captures inaccessible to the public but through his investigation and efforts, was able to change their local policy to allow for further examination of the captures.

With the completion of the workshop, everyone still in attendance joined back together in the Chancellor's Hall of the Senate House as Ian Milligan (@ianmilligan1) and Matthew Weber (@docmattweber) gave a wrap up of the Archives Unleashed 4.0 Datathon, which had occurred prior to the conference on Monday and Tuesday. Part of the wrap-up was time given to three top ranked projects as determined by judges from the British Library. The group with which I was a part from the Datathon, "Team Intersection" was one of the three, so Jess Ogden (@jessogden) gave a summary presentation. More information on our intersection analysis between multiple data sets can be had on our GitHub.io page. A blog post with more details will be posted here in the coming days detailing our report of the Datathon.

Following the AU 4.0 wrap-up, the audience moved to the British Library Knowledge Center for a panel titled, "Web Archives: truth, lies and politics in the 21st century". I was unable to attend this, opting for further refinement of the two presentations I was to give on the second day of IIPC WAC 2017 (see below).

Day Two

The second day of the conference was split into three concurrent tracks -- two at the Senate House and a third at the British Library Knowledge Center. Given I was slated to give two presentations at the latter (and the venues were about 0.8 miles apart), I opted to attend the sessions at the BL.

Nicholas Taylor opened the session with the scope of the presentations for the day and introduced the first three presenters. First on the bill was Andy Jackson with "Digging document out of the web archives." This initially compared this talk to the one he had given the day prior (see above) relating to the workflows in cataloging items. In the second day's talk, he discussed the process of the Digital ePrints team and the inefficiencies of its manual process for ingesting new content. Based on this process, his team setup a new harvester that watches targets, extracts the document and machine-readable metadata from the targets, and submits it to the catalog. Still though, issues remainder with one being what to identify as the "publication" for e-prints relative to the landing page, assets, and what is actually cataloged. He discussed the need for further experimentation using a variety of workflows to optimize the outcome for quality and to ensure the results are discoverable and accessible and the process remain mostly automated.

Ian Milligan and Nick Ruest (@ruebot) followed Andy with their presentation on making their Canadian web archival data sets easier to use. "We want web archives to be used on page 150 in some book.", they said, reinforcing that they want the archives to inform the insights instead of the subject necessarily being about the archives themselves. They also discussed their extraction and processing workflow from acquiring the data from Internet Archive then using Warcbase and other command-line tools to make the data contained within the archives more accessible. Nick said that since last year when they presented webarchives.ca, they have indexed 10 terabytes representative of over 200 million Solr docs. Ian also discussed derivative datasets they had produced inclusive of domain and URI counts, full-text, and graphs. Making the derivative data sets accessible and usable by researchers is a first step in their work being used on page 150.

Greg Wiedeman (@GregWiedeman) presented third in the technical session by first giving context of his work at the University at Albany (@ualbany) where they are required to preserve state records with no dedicated web archives staff. Some records have paper equivalents like archived copies of their Undergraduate Bulletins while digital versions might consist of Microsoft Word documents corresponding to the paper copies. They are using DACS to describe archives, so questioned whether they should use it for Web archives. On a technical level, he runs a Python script to look at their collection of CDXs, which schedules a crawl which is displayed in their catalog as it completes. "Users need to understand where web archives come from,", he says, "and need provenance to frame their research questions, which will add weight to their research."

A short break commenced, followed by Jefferson Bailey presenting, "Who, what when, where, why, WARC: new tools at the Internet Archive". Initially apologizing for repetition of his prior days presentation, Jefferson went into some technical details of statistics IA has generated, APIs they have to offer, and new interfaces with media queries of a variety of sorts. They also have begun to use Simhash to identify dissimilarity between related documents.

I (Mat Kelly, @machawk1) presented next with "Archive What I See Now – Personal Web Archiving with WARCs". In this presentation I described the advancements we had made to WARCreate, WAIL, and Mink with support from the National Endowment for the Humanities, which we have reported on in a few prior blog posts. This presentation served as a wrap-up of new modes added to WARCreate, the evolution of WAIL (See Lipstick or Ham then Electric WAILs and Ham), and integration of Mink (#mink #mink #mink) with local Web archives. Slides below for your viewing pleasure.

Lozana Rossenova (@LozanaRossenova) and Ilya Kreymer (@IlyaKreymer) talked next about Webrecorder and namely about remote browsers. Showing a live example of viewing a web archive with a contemporary browser, technologies that are no longer supported are not replayed as expected, often not being visible at all. Their work allows a user to replicate the original experience of the browser of the day to use the technologies as they were (e.g., Flash/Java applet rendering) for a more accurate portrayal of how the page existed at the time. This is particularly important for replicating art work that is dependent on these technologies to display. Ilya also described their Web Archiving Manifest (WAM) format to allow a collection of Web archives to be used in replaying Web pages with fetches performed at the time of replay. This patching technique allows for more accurate replication of the page at a time.

After Lozana and Ilya finished, the session broke for lunch then reconvened with Fernando Melo (@Fernando___Melo) describing their work at the publicly available Portuguese Web Archive. He showed their work building an image search of their archive using an API to describe Charlie Hebdo-related captures. His co-presenter João Nobre went into further details of the image search API, including the ability to parameterize the search by query string, timestamp, first-capture time, and whether it was "safe". Discussion from the audience afterward asked of the pair what their basis was of a "safe" image.

Nicholas Taylor spoke about recent work with LOCKSS and WASAPI and the re-architecting of the former to open the potential for further integration with other Web archiving technologies and tools. They recently built a service for bibliographic extraction of metadata for Web harvest and file transfer content, which can then be mapped to the DOM tree. They also performed further work on an audit and repair protocol to validate the integrity of distributed copies.

Jefferson again presented to discuss IMLS funded APIs they are developing to test transfers using WASAPI to their partners. His group ran surveys to show that 15-20% of Archive-It users download their WARCs to be stored locally. Their WASAPI Data Transfer API returns a JSON object derived from the set of WARCs transfered inclusive of fields like pagination, count, requested URI, etc. Other fields representative of an Archive-It ID, checksums, and collection information are also present. Naomi Dushay (@ndushay) then showed a video of an overview of their deployment procedure.

After another short break, Jack Cushman & Ilya Kreymer tag-teamed to present, "Thinking like a hacker: Security Issues in Web Capture and Playback". Through a mock dialog, they discussed issues in securing Web archives and a suite of approaches challenging users to compromise a dummy archive. Ilya and Jack also iterated through various security problems that might arise in serving, storing, and accessing Web archives inclusive of stealing cookies, frame highjacking to display a false record, banner spoofing, etc.

Following Ilya and Jack, I (@machawk1, again) and David Dias (@daviddias) presented, "A Collaborative, Secure, and Private InterPlanetary WayBack Web Archiving System using IPFS". This presentation served as follow-on work from the InterPlanetary Wayback (ipwb) project Sawood (@ibnesayeed) had originally built at the Archives Unleashed 1.0 then presented at JCDL 2016, WADL 2016, and TPDL 2016. This work, in collaboration with David of Protocol Labs, who created the InterPlanetary File System (IPFS), was to display some advancements both in IPWB and IPFS. David began with an overview of IPFS, what problem its trying to solve, its system of content addressing, and mechanism to facilitate object permanence. I discussed, as with previous presentations, IPWB's integration of web archive (WARC) files with IPFS using an indexing and replay system that utilize the CDXJ format. One item in David's recent work is bring IPFS to the browsers with his JavaScript port to interface with IPFS from the browsers without the need for a running local IPFS daemon. I had recent introduced encryption and decryption of WARC content to IPWB, allowing for further permanence of archival Web data that may be sensitive in nature. To close the session, we performed a live demo of IPWB consisting of data replication of WARCs from another machine onto the presentation machine.

Following our presentation, Andy Jackson asked for feedback on the sessions and what IIPC can do to support the enthusiasm for open source and collaborative approaches. Discussions commenced among the attendees about how to optimize funding for events, with Jefferson Bailey reiterating the travel eats away at a large amount of the cost for such events. Further discussions were had about why the events we not recorded and on how to remodel the Hackathon events on the likes of other organizations like Mozilla's Global Sprints, the organization of events by the NodeJS community, and sponsoring developers for the Google Summer of Code. The audience then had further discussions on how to followup and communicate once the day was over, inclusive of the IIPC Slack Channel and the IIPC GitHub organization. With that, the second day concluded.

Day 3

By Friday, with my presentations for the trip complete, I now had but one obligation for the conference and the week (other than write my dissertation, of course): to write the blog post you are reading. This was performed while preparing for JCDL 2017 in Toronto the following week (that I attended by proxy, post coming soon). I missed out on the morning sessions, unfortunately, but joined in to catch the end of João Gomes' (@jgomespt) presentation on Arquivo.pt, also presented the prior day. I was saddened to know that I had missed Martin Klein's (@mart1nkle1n) "Uniform Access to Raw Mementos" detailing his, Los Alamos', and ODU's recent collaborative work in extending Memento to support access to unmodified content, among other characteristics that cause a "Raw Memento" to be transformed. WS-DL's own Shawn Jones (@shawnmjones) has blogged about this on numerous occasions, see Mementos in the Raw and Take Two.

The first full session I was able to attend was Abbie Grotke's (@agrotke) presentation, "Oh my, how the archive has grown..." that detailed the progress and size that Library of Congress's Web archive has experienced with minimal growth in staff despite the substantial increase in size of their holdings. While captivated, I came to know via the conference Twitter stream that Martin's third presentation of the day coincided with Abbie's. Sorry, Martin.

I did manage to switch rooms to see Nicholas Taylor discuss using Web archives in legal cases. He stated that in some cases, social media used by courts may only exist in Web archives and that courts now accept archival web captures as evidence. The first instance of using IA's Wayback Machine was in 2004 and its use in courts has been contested many times without avail. The Internet Archive provided affidavit guidance that suggested asking the court to ensure usage of the archive will consider captures as valid evidence. Nicholas alluded to FRE 201 that allows facts to be used as evidence, the basis for which the archive has been used. He also cited various cases where expert testimony of Web archives was used (Khoday v. Symantec Corp., et al.), a defamation case where the IA disclaimer dismissed using it as evidence (Judy Stabile v. Paul Smith Limited et al.), and others. Nicholas also cited WS-DL's own Scott Ainsworth's (@Galsondor) work on Temporal Coherence and how a composite memento may not have existed as displayed.

Following Nicholas, Anastasia Aizman and Matt Phillips (@this_phillips) presented "Instruments for Web archive comparison in Perma.cc". In their work with Harvard's Library Innovation Lab (with which WS-DL's Alex Nwala was recently a Summer fellow), the Perma team has a goal to allow users to cite things on the Web, create WARCs of those things, then be able to organize the captures. Their initial work with the Supreme Court corpus from 1996 to present found that 70% of the references had rotted. Anastasia asked, "How do we know when a web site has changed and how do we know which changed are important?"

They used a variety of ways to determine significant change inclusive of MinHas (via calculating the Jaccard Coefficients), Hamming Distance (via SimHash), and Sequence Matching using a Baseline. As a sample corpus, they took over 2,000 Washington Post articles consisting of over 12,000 resources, examined the SimHash and found big gaps. For MinHash, the distances appeared much closer. In their implementation, they show this to the user on Perma via their banner that provides an option to highlight file changes between sets of documents.

There was a brief break then I attended a session where Peter Webster (@pj_webster) and Chris Fryer (@C_Fryer) discussed their work with the UK Parliamentary Archives. Their recent work consists of capturing official social media feeds of the members of parliament, critical as it captures their relationship with the public. They sought to examine the patterns of use and access by the members and determine the level of understanding of the users of their archive. "Users are hard to find and engage", they said, citing that users were largely ignorant about what web archives are. In a second study, they found that users wanted a mechanism for discovery that mapped to an internal view of how the parliament function. Their studies found many things from web archives that user do not want but a takeaway is that they uncovered some issues in their assumptions and their study raised the profile of the Parliamentary Web Archives among their colleagues.

Emily Maemura and Nicholas Worby presented next with their discussion on origin studies as it relates to web archives, provenance, and trust. They examined decisions made in creating collections in Archive-It by the University of Toronto Libraries, namely the collections involving the Canadian political parties, the Toronto 2015 Pam Am games, and their Global Summitry Archive. From these they determined the three traits of each were that they were long running, a one-time event, and a collaboratively created archive, respectively. For the candidates' sites, they also noticed the implementation of robots.txt exclusions in a supposed attempt to prevent the sites from being archived.

Alexis Antracoli and Jackie Dooley (@minniedw) presented next about their OCLC Research Library Partnership web archive working group. Their examination determined that discoverability was the primary issue for users. Their example of using Archive-It at Princeton but that the fact was not documented was one such issue. Through their study they established use cases for libraries, archives, and researchers. In doing so, they created a data dictionary of characteristics of archives inclusive of 14 data elements like Access/rights, Creator, Description, etc. with many fields having a direct mapping to Dublin Core.

With a short break, the final session then began. I attended the session where Jane Winters (@jfwinters) spoke about increasing the visibility of web archives, asking first, "Who is the audience for Web archives?" then enumerating researchers in the arts, humanities and social sciences. She then described various examples in the press relating to web archives inclusive of Computer Weekly report on Conservatives erasing official records of speeches from IA and Dr. Anat Ben-David's work on getting the .yu TLD restored in IA.

Cynthia Joyce then discussed her work in studying Hurricane Katrina's unsearchable archive. Because New Orleans was not a tech savvy place at the time and it was pre-Twitter, Facebook was young, etc., the personal record was not what it would be were the events to happen today. In her researcher as a citizen, she attempted to identify themes and stories that would have been missed in mainstream media. She said, "On Archive-It, you can find the Katrina collection ranging from resistance to gratitude." Only 8-9 years later did she collect the information, for which many of the writers never expect to be preserved.

For the final presentation of the conference, Colin Post (@werrthe) discussed net-based art and how to go about making them objects of art history. Colin used Alexi Shulgin's "Homework" as an example that uses pop-ups and self-conscious elements that add to the challenge of preservation. In Natalie Bookchin's course, Alexei Shulgin encouraged artists to turn in homework for grading, also doing so himself. His assignment is dominated with popups, something we view in a different light today. "Archives do not capture the performative aspect of the piece", Colin said. Citing oldweb.today provides interesting insights into how the page was captured over time with multiple captures being combined. "When I view the whole piece, it is emulated and artificial; it is disintegrated and inauthentic."

Synopsis

The trip proved very valuable to my research. Not documented in this post was the time between sessions where I was able to speak to some of the presenters about their as it related to my own and even to those that were not presenting in finding an intersection in our respective research.

Mat (@machawk1)