Showing posts with label WAC. Show all posts
Showing posts with label WAC. Show all posts

Wednesday, July 19, 2017

2017-07-19: Archives Unleashed 4.0: Web Archive Datathon Trip Report


They: Hey Sawood, nice to see you again.
Me: Hi, I am glad to see you too.
They: Did you attend all hackathons, I mean datathons?
Me: Yes, I attended all of the four Archives Unleashed events so far.
They: How did you like it?
Me: Well, there is a reason why I attended all of them, despite being a seemingly busy PhD researcher.
They: So, what is your research about?
Me: I am trying to profile various web archives to build a high-level understanding of their holdings, primarily, for the sake of efficiently routing Memento aggregation requests, but there can be many more use cases of such profiles... [and the conversation continues...]


On day zero of Archives Unleashed 4.0 in London, conversations among many familiar and unfamiliar faces started with travel and lodging related questions, but soon emerged into mass storage challenges, scaling issues, quality and coverage of web archives, long-term maintenance of archival tools, documentation and discovery of libraries, and exchange of research ideas etc. Ian and Matt were looking fresh and welcoming in the reception of #HackArchives as always. This was all familiar, this is how other previous AU events started too, and yielded great networking among the web archiving community members.


Previously, the Web Science and Digital Libraries Research Group (WSDL) has been well-represented at AU events, but visa issues and competing events meant that only Mat and I were able to attend.


The next day, on Monday, June 12, 2017, the main event started at the British Library in the morning with usual registration process, welcome kit, and AU-branded, 3D-printed looking, strange red rubber balls (that no one had any idea what to do with it). Dr. Matthew Weber and Dr. Ian Milligan began with the opening remarks, described the scope of the event, and available dataset and other resources.


Next was the current efforts session for which Ian, Jefferson, Tom, and Andy were supposed to talk about Warcbase, Internet Archive APIs, National Archives Datasets, and UK Web Archive respectively. Since Jefferson could not make it to the event on time, Ian had to morph into Jefferson for the corresponding talk about IA APIs. All of these talks were very insightful and had a lot to learn from.

Possibly the most interesting aspect of AU events is the phenomenon of the group formation. People and idea stickers flock around the room and naturally cluster in smaller groups with similar interest to come up with a more precise research question and datasets to use. This time, they formed a total of eight different groups with diverse set of research questions and scopes.


After the lunch break teams settled on their tables and started worrying about task refinement, computing resources, data acquisition, and action plan. One of the most difficult issues at AU events is the problem of data set acquisition. Advertised datasets are often not in the easy-to-get condition. Additionally, these datasets are often too large to be copied over to the respective computing instances in a feasible amount of time. Some preprocessing and sampling can be helpful. Additionally, complex (and often unknown) authentication barriers should be removed from the data acquisition process. On one hand it is part of the learning process to acquire and understand the data and learn about other tools to create derivative data, but on the other hand I have consistently noticed that this process is difficult and limits the opportunity for actual data analysis.

Another very useful aspect of AU events is the opportunity to allow people to share their current projects and efforts in the field of web archiving using short lightning talks. In the past we have taken advantage of it to introduce various WSDL efforts such as MemGator, IPWB, CarbonDate, WhatDidItLookLike, and ICanHazMemento. Following the tradition, this time also there were a handful of lightning talks lined up for both the days.
After the first round of five lightning talks teams went back to their hacking task, mostly trying to acquire datasets, understand them, and adjust their ambitious plans to something more feasible withing the short time limit. Then everyone left for the dinner while discussing ideas and scope of their work with their team members. The dinner was really good, but it did not stop people from exchanging world-shaking ideas.


The next morning many teams were talking about how much data they processed overnight and what to do next. The next couple of hours were very critical for every team to come up with something that provides some answers to their proposed research questions. After another session of lightning talks, teams continued to work on their projects, but now they started thinking about reporting aspect and visualizations of their findings as more and more results are apparent. The efforts continued during and after the short lunch break. One could see people multi-tasking to get everything done before the final presentations that was only a coffee break away, but some people still had courage to put everything aside for a while and go for a walk outside. Not every team was working on data analysis, but the overall experience was still generalizable. Finally, the time has arrived for brief project presentations and share findings of the "Samudra Manthan" in front of three esteemed judges from the British Library.
  • Team Portuguese Archive presented their outcome of archived image classification using TensorFlow. As a testbed they used maps to distinguish contemporary maps from historic maps.
  • Team Intersect (of which I was a member) presented the archival coverage of Occupy Wall Street movement in various collections and social media along with overlap among various datasets. They found less than 1% of overlap among different datasets which means the more collectors the better coverage. They also found that two-third of the outlinks from these collections were not archived.
  • The Olympians presented gender distribution in Olympic committees and found strong male bias.
  • Team Shipman Report analyzed text in Shipman Report and found it deadly and dark.
  • Team Links analyzed WARC files to find the trend in distribution of relative/absolute paths and absolute URLs in anchor element along with HTML element distribution around anchors over the time.
  • Team Robots analyzed different types of robots.txt files in web archives with the intent of finding the impact on archival captures if the robots.txt was honored. They found that the impact will not be huge.
  • Team Curated built a prototype of an upcoming Rhizome tool for better curation and annotation. They illustrated some wire frame prototypes of various components and workflow.
  • Team WARCs peeked inside WARC files for traces of politics and elections in the US.
While judges were deciding winners, Ian wrapped up the event by looking back at the past two days and briefly mentioning the highlights of the event. He gave vote of thanks for all individuals and sponsoring organizations who supported the event in various ways including data and computing resources, venue and logistics, and travel grants. Judges' verdict was in; Team Links, Team Robots, and Team Intersect were found guilty of being the best. Everyone was a winner, but some of them performed more efficiently than others within a very short span of time. I am sure every team had much more to show than what they could in the short five minutes presentation.

Now, it was the time to disperse around and continue exchanging ideas over drinks and dinner while getting ready for the rest of the Web Archiving Week events.

They: So, Sawood, are you planning to continue attending all future AU events?  
Me: I hope so! ;-)


--
Sawood Alam

Monday, June 26, 2017

2017-06-26: IIPC Web Archiving Conference (WAC) Trip Report

Mat Kelly reports on the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017 in London, England.                            

In the latter part of Web Archiving Week (#waweek2017) from Wednesday to Friday, Sawood and I attended the International Internet Preservation Consortium (IIPC) Web Archiving Conference (WAC) 2017, held jointly with the RESAW Conference at the Senate House and British Library Knowledge Center in London. Each of the three days had multiple tracks. Reported here are the presentations I attended.

Prior to the keynote, Jane Winters (@jfwinters) of University of London and Nicholas Taylor (@nullhandle) welcomed the crowd with admiration toward the Senate House venue. Leah Lievrouw (@Leah53) from UCLA then began the keynote. In her talk, she walked through the evolution of the Internet as a medium to access information prior to and since the Web.

With reservation toward the "Web 3.0" term, Leah described a new era in the shift from documents to conversations, to big data. With a focus toward the conference, Leah described the social science and cultural break down as it has applied to each Web era.

After the keynote, two concurrent presentation tracks proceeded. I attended a track where Jefferson Bailey (@jefferson_bail) presented "Advancing access and interface for research use of web archives". First citing an updated metric of the Internet Archive's holdings (see Ian's tweet below), Jefferson provided a an update on some contemporary holdings and collections by IA inclusive of some of the details on his GifCities project (introduced with IA's 20th anniversary, see our celebration), which provides searchable access to the the archive's holdings of the animated GIFs that once resided on Geocities.com.

In addition to this, Jefferson also highlighted the beta features of the Wayback Machine, inclusive of anchor text-based search algorithm, MIME-type breakdown, and much more. He also described some other available APIs inclusive of one built on top of WAT files, a metadata format derived from WARC.

Through recent efforts by IA for their anniversary, they also had put together a collection of military PowerPoint slide decks.

Following Jefferson, Niels Brügger (@NielsBr) lead a panel consisting of a subset of authors from the first issue of his journal, "Internet Histories". Marc Weber stated that the journal had been in the works for a while. When he initially told people he was looking at the history of the Web in the 1990s, people were puzzled. He went on to compare the Internet to be in its Victorian era as evolved from 170 years of the telephone and 60 years of being connected through the medium. Of the vast history of the Internet we have preserved relatively little. He finished with noting that we need to treat history and preservation as something that should be done quickly, as we cannot go back later to find the materials if they are no preserved.

Steve Jones of University of Illinois at Chicago spoke second about the Programmed Logic for Automatic Teaching Operations (PLATO) system. There were two key interests, he said, in developing for PLATO -- multiplayer games and communication. The original PLATO lab was in a large room and because of laziness, they could not be bothered to walk to each other's desks, so developed the "Talk" system to communicate and save messages so the same message would not have to be communicated twice. PLATO was not designed for lay users but for professionals, he said, but was also used by university and high school students. "You saw changes between developers and community values," he said, "seeing development of affordances in the context of the discourse of the developers that archived a set of discussions." Access to the PLATO system is still available.

Jane Winters presented third on the panel stating that there is a lot of archival content that has seen little research engagement. This may be due to continuing work on digitizing traditional texts but it is hard to engage with the history of the 21st century without engaging with the Web. The absence of metadata is another issue. "Our histories are almost inherently online", she said, "but they only gain any real permanence through preservation in Web archives. That's why humanists and historians really need to engage with them."

The tracks then joined together for lunch and split back into separate sessions, where I attended the presentation, "A temporal exploration of the composition of the UK Government Web Archive". In this presentation they examined the evolution of the UK National Archives (@uknatarchives). This was followed by a presentation by Caroline Nyvang (@caobilbao) of the Royal Danish Library that examined current web referencing practices. Her group proposed the persistent web identifier (PWID) format for referencing Web archives, which was eerily familiar to the URI semantics often used in another protocol.

Andrew (Andy) Jackson (@anjacks0n) then took the stage to discuss the UK Web Archive's (@UKWebArhive) catalog and challenges they have faced while considering the inclusion of Web archive material. He detailed a process, represented by a hierarchical diagram, to describe the sorts of transformations required in going from the data to reports and indexes about the data. In doing so, he also juxtaposed and compared his process with other archival workflows that would be performed in a conventional library catalog architecture.

Following Andy, Nicola Bingham (@NicolaJBingham) discussed curating collections at the UK Web Archive, which has been archiving since 2013, and challenges in determine the boundaries and scope of what should be collected. She encouraged researchers to engage to shape their collections. Their current holdings consist of about 400 terabytes with 11 to 12 billion records, growing 60 to 70 terabytes and 3 billion records per year. Their primary mission is to collect UK web sites under UK TLDs (like .uk, .scot, .cymru, etc). Domains are currently capped at 512 megabytes being preserved but even then other technical limitations exists in capture like proprietary formats, plugins, robots.txt, etc).

When Nicola finished, there was a short break. Following that, I traveled upstairs of the Senate House to the "Data, process, and results" workshop, lead by Emily Maemura (@emilymaemura). She first described three different research projects where each of the researchers were present and asked attendees to break out into groups to discuss the various facets of each project in detail with each researcher. I opted to discuss Frederico Nanni's (@f_nanni) work with him and a group of other attendees. His work consisted of analyzing and resolving issues in the preservation of the web site of the University of Bologna. The site specifies a robots.txt exclusion, which makes the captures inaccessible to the public but through his investigation and efforts, was able to change their local policy to allow for further examination of the captures.

With the completion of the workshop, everyone still in attendance joined back together in the Chancellor's Hall of the Senate House as Ian Milligan (@ianmilligan1) and Matthew Weber (@docmattweber) gave a wrap up of the Archives Unleashed 4.0 Datathon, which had occurred prior to the conference on Monday and Tuesday. Part of the wrap-up was time given to three top ranked projects as determined by judges from the British Library. The group with which I was a part from the Datathon, "Team Intersection" was one of the three, so Jess Ogden (@jessogden) gave a summary presentation. More information on our intersection analysis between multiple data sets can be had on our GitHub.io page. A blog post with more details will be posted here in the coming days detailing our report of the Datathon.

Following the AU 4.0 wrap-up, the audience moved to the British Library Knowledge Center for a panel titled, "Web Archives: truth, lies and politics in the 21st century". I was unable to attend this, opting for further refinement of the two presentations I was to give on the second day of IIPC WAC 2017 (see below).

Day Two

The second day of the conference was split into three concurrent tracks -- two at the Senate House and a third at the British Library Knowledge Center. Given I was slated to give two presentations at the latter (and the venues were about 0.8 miles apart), I opted to attend the sessions at the BL.

Nicholas Taylor opened the session with the scope of the presentations for the day and introduced the first three presenters. First on the bill was Andy Jackson with "Digging document out of the web archives." This initially compared this talk to the one he had given the day prior (see above) relating to the workflows in cataloging items. In the second day's talk, he discussed the process of the Digital ePrints team and the inefficiencies of its manual process for ingesting new content. Based on this process, his team setup a new harvester that watches targets, extracts the document and machine-readable metadata from the targets, and submits it to the catalog. Still though, issues remainder with one being what to identify as the "publication" for e-prints relative to the landing page, assets, and what is actually cataloged. He discussed the need for further experimentation using a variety of workflows to optimize the outcome for quality and to ensure the results are discoverable and accessible and the process remain mostly automated.

Ian Milligan and Nick Ruest (@ruebot) followed Andy with their presentation on making their Canadian web archival data sets easier to use. "We want web archives to be used on page 150 in some book.", they said, reinforcing that they want the archives to inform the insights instead of the subject necessarily being about the archives themselves. They also discussed their extraction and processing workflow from acquiring the data from Internet Archive then using Warcbase and other command-line tools to make the data contained within the archives more accessible. Nick said that since last year when they presented webarchives.ca, they have indexed 10 terabytes representative of over 200 million Solr docs. Ian also discussed derivative datasets they had produced inclusive of domain and URI counts, full-text, and graphs. Making the derivative data sets accessible and usable by researchers is a first step in their work being used on page 150.

Greg Wiedeman (@GregWiedeman) presented third in the technical session by first giving context of his work at the University at Albany (@ualbany) where they are required to preserve state records with no dedicated web archives staff. Some records have paper equivalents like archived copies of their Undergraduate Bulletins while digital versions might consist of Microsoft Word documents corresponding to the paper copies. They are using DACS to describe archives, so questioned whether they should use it for Web archives. On a technical level, he runs a Python script to look at their collection of CDXs, which schedules a crawl which is displayed in their catalog as it completes. "Users need to understand where web archives come from,", he says, "and need provenance to frame their research questions, which will add weight to their research."

A short break commenced, followed by Jefferson Bailey presenting, "Who, what when, where, why, WARC: new tools at the Internet Archive". Initially apologizing for repetition of his prior days presentation, Jefferson went into some technical details of statistics IA has generated, APIs they have to offer, and new interfaces with media queries of a variety of sorts. They also have begun to use Simhash to identify dissimilarity between related documents.

I (Mat Kelly, @machawk1) presented next with "Archive What I See Now – Personal Web Archiving with WARCs". In this presentation I described the advancements we had made to WARCreate, WAIL, and Mink with support from the National Endowment for the Humanities, which we have reported on in a few prior blog posts. This presentation served as a wrap-up of new modes added to WARCreate, the evolution of WAIL (See Lipstick or Ham then Electric WAILs and Ham), and integration of Mink (#mink #mink #mink) with local Web archives. Slides below for your viewing pleasure.

Lozana Rossenova (@LozanaRossenova) and Ilya Kreymer (@IlyaKreymer) talked next about Webrecorder and namely about remote browsers. Showing a live example of viewing a web archive with a contemporary browser, technologies that are no longer supported are not replayed as expected, often not being visible at all. Their work allows a user to replicate the original experience of the browser of the day to use the technologies as they were (e.g., Flash/Java applet rendering) for a more accurate portrayal of how the page existed at the time. This is particularly important for replicating art work that is dependent on these technologies to display. Ilya also described their Web Archiving Manifest (WAM) format to allow a collection of Web archives to be used in replaying Web pages with fetches performed at the time of replay. This patching technique allows for more accurate replication of the page at a time.

After Lozana and Ilya finished, the session broke for lunch then reconvened with Fernando Melo (@Fernando___Melo) describing their work at the publicly available Portuguese Web Archive. He showed their work building an image search of their archive using an API to describe Charlie Hebdo-related captures. His co-presenter João Nobre went into further details of the image search API, including the ability to parameterize the search by query string, timestamp, first-capture time, and whether it was "safe". Discussion from the audience afterward asked of the pair what their basis was of a "safe" image.

Nicholas Taylor spoke about recent work with LOCKSS and WASAPI and the re-architecting of the former to open the potential for further integration with other Web archiving technologies and tools. They recently built a service for bibliographic extraction of metadata for Web harvest and file transfer content, which can then be mapped to the DOM tree. They also performed further work on an audit and repair protocol to validate the integrity of distributed copies.

Jefferson again presented to discuss IMLS funded APIs they are developing to test transfers using WASAPI to their partners. His group ran surveys to show that 15-20% of Archive-It users download their WARCs to be stored locally. Their WASAPI Data Transfer API returns a JSON object derived from the set of WARCs transfered inclusive of fields like pagination, count, requested URI, etc. Other fields representative of an Archive-It ID, checksums, and collection information are also present. Naomi Dushay (@ndushay) then showed a video of an overview of their deployment procedure.

After another short break, Jack Cushman & Ilya Kreymer tag-teamed to present, "Thinking like a hacker: Security Issues in Web Capture and Playback". Through a mock dialog, they discussed issues in securing Web archives and a suite of approaches challenging users to compromise a dummy archive. Ilya and Jack also iterated through various security problems that might arise in serving, storing, and accessing Web archives inclusive of stealing cookies, frame highjacking to display a false record, banner spoofing, etc.

Following Ilya and Jack, I (@machawk1, again) and David Dias (@daviddias) presented, "A Collaborative, Secure, and Private InterPlanetary WayBack Web Archiving System using IPFS". This presentation served as follow-on work from the InterPlanetary Wayback (ipwb) project Sawood (@ibnesayeed) had originally built at the Archives Unleashed 1.0 then presented at JCDL 2016, WADL 2016, and TPDL 2016. This work, in collaboration with David of Protocol Labs, who created the InterPlanetary File System (IPFS), was to display some advancements both in IPWB and IPFS. David began with an overview of IPFS, what problem its trying to solve, its system of content addressing, and mechanism to facilitate object permanence. I discussed, as with previous presentations, IPWB's integration of web archive (WARC) files with IPFS using an indexing and replay system that utilize the CDXJ format. One item in David's recent work is bring IPFS to the browsers with his JavaScript port to interface with IPFS from the browsers without the need for a running local IPFS daemon. I had recent introduced encryption and decryption of WARC content to IPWB, allowing for further permanence of archival Web data that may be sensitive in nature. To close the session, we performed a live demo of IPWB consisting of data replication of WARCs from another machine onto the presentation machine.

Following our presentation, Andy Jackson asked for feedback on the sessions and what IIPC can do to support the enthusiasm for open source and collaborative approaches. Discussions commenced among the attendees about how to optimize funding for events, with Jefferson Bailey reiterating the travel eats away at a large amount of the cost for such events. Further discussions were had about why the events we not recorded and on how to remodel the Hackathon events on the likes of other organizations like Mozilla's Global Sprints, the organization of events by the NodeJS community, and sponsoring developers for the Google Summer of Code. The audience then had further discussions on how to followup and communicate once the day was over, inclusive of the IIPC Slack Channel and the IIPC GitHub organization. With that, the second day concluded.

Day 3

By Friday, with my presentations for the trip complete, I now had but one obligation for the conference and the week (other than write my dissertation, of course): to write the blog post you are reading. This was performed while preparing for JCDL 2017 in Toronto the following week (that I attended by proxy, post coming soon). I missed out on the morning sessions, unfortunately, but joined in to catch the end of João Gomes' (@jgomespt) presentation on Arquivo.pt, also presented the prior day. I was saddened to know that I had missed Martin Klein's (@mart1nkle1n) "Uniform Access to Raw Mementos" detailing his, Los Alamos', and ODU's recent collaborative work in extending Memento to support access to unmodified content, among other characteristics that cause a "Raw Memento" to be transformed. WS-DL's own Shawn Jones (@shawnmjones) has blogged about this on numerous occasions, see Mementos in the Raw and Take Two.

The first full session I was able to attend was Abbie Grotke's (@agrotke) presentation, "Oh my, how the archive has grown..." that detailed the progress and size that Library of Congress's Web archive has experienced with minimal growth in staff despite the substantial increase in size of their holdings. While captivated, I came to know via the conference Twitter stream that Martin's third presentation of the day coincided with Abbie's. Sorry, Martin.

I did manage to switch rooms to see Nicholas Taylor discuss using Web archives in legal cases. He stated that in some cases, social media used by courts may only exist in Web archives and that courts now accept archival web captures as evidence. The first instance of using IA's Wayback Machine was in 2004 and its use in courts has been contested many times without avail. The Internet Archive provided affidavit guidance that suggested asking the court to ensure usage of the archive will consider captures as valid evidence. Nicholas alluded to FRE 201 that allows facts to be used as evidence, the basis for which the archive has been used. He also cited various cases where expert testimony of Web archives was used (Khoday v. Symantec Corp., et al.), a defamation case where the IA disclaimer dismissed using it as evidence (Judy Stabile v. Paul Smith Limited et al.), and others. Nicholas also cited WS-DL's own Scott Ainsworth's (@Galsondor) work on Temporal Coherence and how a composite memento may not have existed as displayed.

Following Nicholas, Anastasia Aizman and Matt Phillips (@this_phillips) presented "Instruments for Web archive comparison in Perma.cc". In their work with Harvard's Library Innovation Lab (with which WS-DL's Alex Nwala was recently a Summer fellow), the Perma team has a goal to allow users to cite things on the Web, create WARCs of those things, then be able to organize the captures. Their initial work with the Supreme Court corpus from 1996 to present found that 70% of the references had rotted. Anastasia asked, "How do we know when a web site has changed and how do we know which changed are important?"

They used a variety of ways to determine significant change inclusive of MinHas (via calculating the Jaccard Coefficients), Hamming Distance (via SimHash), and Sequence Matching using a Baseline. As a sample corpus, they took over 2,000 Washington Post articles consisting of over 12,000 resources, examined the SimHash and found big gaps. For MinHash, the distances appeared much closer. In their implementation, they show this to the user on Perma via their banner that provides an option to highlight file changes between sets of documents.

There was a brief break then I attended a session where Peter Webster (@pj_webster) and Chris Fryer (@C_Fryer) discussed their work with the UK Parliamentary Archives. Their recent work consists of capturing official social media feeds of the members of parliament, critical as it captures their relationship with the public. They sought to examine the patterns of use and access by the members and determine the level of understanding of the users of their archive. "Users are hard to find and engage", they said, citing that users were largely ignorant about what web archives are. In a second study, they found that users wanted a mechanism for discovery that mapped to an internal view of how the parliament function. Their studies found many things from web archives that user do not want but a takeaway is that they uncovered some issues in their assumptions and their study raised the profile of the Parliamentary Web Archives among their colleagues.

Emily Maemura and Nicholas Worby presented next with their discussion on origin studies as it relates to web archives, provenance, and trust. They examined decisions made in creating collections in Archive-It by the University of Toronto Libraries, namely the collections involving the Canadian political parties, the Toronto 2015 Pam Am games, and their Global Summitry Archive. From these they determined the three traits of each were that they were long running, a one-time event, and a collaboratively created archive, respectively. For the candidates' sites, they also noticed the implementation of robots.txt exclusions in a supposed attempt to prevent the sites from being archived.

Alexis Antracoli and Jackie Dooley (@minniedw) presented next about their OCLC Research Library Partnership web archive working group. Their examination determined that discoverability was the primary issue for users. Their example of using Archive-It at Princeton but that the fact was not documented was one such issue. Through their study they established use cases for libraries, archives, and researchers. In doing so, they created a data dictionary of characteristics of archives inclusive of 14 data elements like Access/rights, Creator, Description, etc. with many fields having a direct mapping to Dublin Core.

With a short break, the final session then began. I attended the session where Jane Winters (@jfwinters) spoke about increasing the visibility of web archives, asking first, "Who is the audience for Web archives?" then enumerating researchers in the arts, humanities and social sciences. She then described various examples in the press relating to web archives inclusive of Computer Weekly report on Conservatives erasing official records of speeches from IA and Dr. Anat Ben-David's work on getting the .yu TLD restored in IA.

Cynthia Joyce then discussed her work in studying Hurricane Katrina's unsearchable archive. Because New Orleans was not a tech savvy place at the time and it was pre-Twitter, Facebook was young, etc., the personal record was not what it would be were the events to happen today. In her researcher as a citizen, she attempted to identify themes and stories that would have been missed in mainstream media. She said, "On Archive-It, you can find the Katrina collection ranging from resistance to gratitude." Only 8-9 years later did she collect the information, for which many of the writers never expect to be preserved.

For the final presentation of the conference, Colin Post (@werrthe) discussed net-based art and how to go about making them objects of art history. Colin used Alexi Shulgin's "Homework" as an example that uses pop-ups and self-conscious elements that add to the challenge of preservation. In Natalie Bookchin's course, Alexei Shulgin encouraged artists to turn in homework for grading, also doing so himself. His assignment is dominated with popups, something we view in a different light today. "Archives do not capture the performative aspect of the piece", Colin said. Citing oldweb.today provides interesting insights into how the page was captured over time with multiple captures being combined. "When I view the whole piece, it is emulated and artificial; it is disintegrated and inauthentic."

Synopsis

The trip proved very valuable to my research. Not documented in this post was the time between sessions where I was able to speak to some of the presenters about their as it related to my own and even to those that were not presenting in finding an intersection in our respective research.

Mat (@machawk1)

Friday, July 6, 2012

2012-07-05: Exploring the WAC: Challenges in Providing Access to the World's Web Archives


The Web Archive Cooperative (WAC) held its 2012 Summer Workshop June 29–30 at Stanford University Palo Alto, California. The workshop focused on the challenges (and some solutions) of providing easy access to the World’s web archives. The WS-DL Research Group had six members in attendance.

Memento and Source Code Repositories — Harihar Shankar (LANL) 

Memento allows temporal access to web resources using datetime. Version control services such as GitHub also allow temporal access, but using a version number instead of datetime. Harihar Shankar of the Los Alamos National Laboratory (LANL) Research Library presented Memento and a Memento/GitHub proxy prototyped at LANL. The proxy enables access to GitHub projects through datetime. For many use cases, datetime is much simpler that Git’s 25-hex-character commit id.

A Research Agenda for “Obsolete Data or Resources” — Michael Nelson (ODU)

Old Dominion University’s Michael Nelson presented WAC’s research agenda for obsolete data and resources. His presentation covered the public’s misconceptions about web archiving, where the web archiving community can improve, the origin of the current notion of time on the web, the gaps bridged by Memento, and some of the progress made to date. Many details and examples are available in the slides.

Building Full Text Indexes of Web Content using Open Source Tools — Erik Hetzner (California Digital Library)

Google knows how to index the Web and allow casual users to discover resources in mere seconds. Add time to the mix and current indexing and search solutions break down. Eric Hetzner described the challenges and approaches of temporal currently being address at California Digital Library (CDL). CDL has 49 public archives, 19 partners, and nearly 1 billion URLs across archives of 3684 web sites. Nearly 50TB of archives must be stored, indexed, archived, and searched. CDL’s current solutions, which use NutchWAX, do not easily allow for deduplication, metadata indexing, and other optimizations. These and other architectural limitations motivated CDL to begin building anew.

‘‘Tiki + HDFS + Pig + solr = weari’’

CDL is now using a combination of open source products for its new WEb ARchiving Indexer (weari). Tika is used for text parsing, Hadoop and HDFS for scalability, Pig for data analysis, and Solr for search.  Erik's slides are now available.

Issues in Preserving Scientific and Scholarly Data in Web Archiving — Laura Wynholds (UCLA)

Laura Wynholds studies scientists and what they do with their data. She has been working with scientists from the Center for Embedded Network Sensing (CENS) and Sloan Digital Sky Survey. At both, she has found a variety of data lifecycles and standards. She has found that data and its associated documentation is shared in many ways, form formal institutional stewardship and repositories to informal means such as email, FedEx, and web sites. Large, well-used data sets tend to have very good preservation arrangements. Medium and small data sets do not. However, many medium and small data sets are shared on the web and could be subject to web archiving. The web archive status of two data sets (The VLA FIRST Survey and COMPLETE) was assessed. Neither was well-represented in public web archives. Those data that were archived were not in formats required by scientists (e.g. low-resolution images). So, web archiving can preserve scientific data, but changes in selection criteria are required for web archiving to be truely effective.

Whose Content is it Anyway? User Perspectives on Archiving Social Media — Cathy Marshall (Microsoft Research)

Cathy Marshall presented her current findings on the public’s views on ownership and reuse of visual media. In the web archiving community we feel the need to preserve the historical Web just as libraries have traditionally preserved copies of books, newspapers, and magazines. Cathy’s research addresses the social issues with which we in the web archiving community must contend. Many photographs, blogs, and tweets and publically accessable on the web, which makes archiving them technically simple. However, people learn that their pictures and posts are being archived, are frequently surprised and upset by the fact. This is especially true if the archiving organization is a government entity such as the Library of Congress. Much of Cathy’s presentation is covered in detail in her JCDL’12 paper “On the institutional archiving of social media”.

Panel: Legal Opportunities for Web Archiving — Kathy Hashimoto and David Hansen (Berkeley Digital Library Copyright Project)

An other important consideration for web archivists is copyright. The “Legal Opportunities for Web Archiving” panel discussion focused on approaches to ensure web archiving is and remains free of legal burden and litigation. In the United States, copyright is derived from Article I, Section 8 of the Consitution and USC Title 17, Chapter 1. There are legal opportunities for web archiving in § 107 (Fair Use), § 108 (Libraries and Archives), § 109 (“First Sale” Doctrine), § 110 (Non-profit performances). The panel discussed the structure of copyright and the issues and problems with copyright in the web archiving context. More information is available on Members of the Berkeley Digital Library Copyright Project web site.

ArcSpread: Familiar Concepts Towards Archive Analytics for Social Scientists — Andreas Paepcke (Stanford)

Web archives have been collecting information for nearly two decades, but making this information easily accessable to non-Computer Scientists continues be a challenge. Andreas Paepcke is working with social scientists to build tools that allow high-level interaction with archives. The ArcSpread tool (Narrated demo) uses the Stanford WebBase as its data source. A spreadsheet metaphor provides a working environment familiar to most computer users.


Text-Entity-Time Analytics in a Temporal Coherent Web Archive — Marc Spaniol (LAWA Project)

Marc Spaniol is a member of the Longitudinal Analytics of Web Archive Data (LAWA) project where he studies temporal aspects of Web evolution. A detailed description is presented in "Tracking entities in web archives: the LAWA project". Web Archives are a gold mine of information, but we lack effective mining tools. Currently, entity tracking is labor-intensive and tedius process. The relevant URIs must be known and web archive searching is notoriously difficult. Additionally, following web archive links creates time diffusion and web archive crawls suffer from temporal incoherence. Text-Entity-Time Analytics focuses on tracking entities (people, places, etc.) over time. The AIDA framework is an online tool for entity detection and disambiguation. Measuring temporal incoherence requires is key to understanding the sources of incoherence. Spaniol has developed the SHARC framework that allows incoherence measurement and demonstrated that simple changes to crawling strategies will improve temporal coherence.

Archiving Web Pages with Hadoop and Pig — Aaron Binns (Internet Archive)

The Internet Archive (IA) currently holds over 176,000,000,000 resources that require nearly 3 petabytes of storage stored as Web Archive (WARC), CDX, and Web Archive Transformation (WAT) files. IA processes this mass of resources using Hadoop and Pig. The problem definition, big data description, and architectural overview presented by Binns were excellent. The slides contain many more details and are well worth a look even without Aaron’s live explanation.

Beyond BigData: Challenges for Facebook’s Data Infrastructure – Sameet Agarwal (Facebook)

When it comes to big data, few would argue that Facebook has more data to crunch than nearly anyone else. Sameet Argawal manages Facebooks 100PB (yes, petabyte!) Hadoop cluster—the largest Hadoop cluster in the world. Facebook’s needs have driven it contribute to Hadoop and to lead the development Hive, a peta-scale data warehouse based on Hadoop. This data warehouse has been the source for several interesting studies, including the recently-publicized reduction of six degrees of separation to four (actually 4.74). While a 100PB Hadoop cluster many seem like a problem solved, many issues still need research and resolution. How to keep a 100PB cluster running. How to fairly allocate resources to multiple tenants. How coordinate mutiple clusters. How to coordinate multiple, geographically dispersed clusters. Currently, log data from www.facebook.com is delivered overnight. How can this latency be reduced or eliminated. Facebook’s data is naturally a graph. Is the set of tables the best way to represent the data? Is converting graph data into a set of map-reduce jobs the right approach.

Acknowledgements

Many thanks to Frank McCown (Harding University) for organizing the workshop, Andreas Paepcke and Hector Garcia-Molina at Stanford for hosting, the National Science Foundation (NSF) for their support (1009916), and especially Marianne Siroker for all the time and effort she put into the food and facilities arrangments.


— Scott G. Ainsworth

Monday, October 4, 2010

2010-10-04: WAC Kickoff Meeting; LC Storage Architectures Meeting, DPC Award Shortlist

On September 24, I attended the kickoff meeting at Stanford for the Web Archiving Cooperative (WAC) Project, a joint NSF project (~$2.8M) between Stanford, Old Dominion and Harding. A summary of the meeting will be published at a later date, but it was attended by several members of our Advisory Board (from memory: Chris Borgman (UCLA), Trisha Cruse (CDL), Rick Furuta (TAMU), Alon Halevy (Google), Carl Lagoze (Cornell), Raghu Ramakrishnan (Yahoo), Herbert Van de Sompel (LANL)) and several members and friends of the Stanford Infolab.

I gave two presentations, the first was a quick review of the state of web preservation (with the obligatory heavy emphasis on Memento), and the second was some of my ruminations about future things that we should (or should not) explore in the context of WAC.





That night I caught a redeye back to Norfolk so I could be in DC the following Monday for the Library of Congress Designing Storage Architectures for Preservation Collections Meeting. While I believe this is their fourth such meeting, it is the first one I attended and while (because?) I did not present or speak, I learned a great deal. The meeting featured a good mix of academicians and storage industry leaders discussing very large scale storage architectures -- scales that we don't typically approach in our research at ODU. The majority of the presentations were limited to 5 minutes each, so a good breadth of topics was covered and perusing the slides will be worth your time.

Finally, Memento has been named one of five finalists for the Digital Preservation Coalition 2010 Digital Preservation Award. It is an honor to be a finalist amongst the other projects (see the DPC Press Release for a descriptions of all the projects). The Library of Congress has also issued a press release as well as ODU. The final announcement will come in December -- here's hoping Memento can bring in the prize.

--Michael