Web Science and Digital Libraries Research Group: wadl

Showing posts with label wadl. Show all posts

Monday, June 11, 2018

2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

Mat Kelly reports on the Web Archiving and Digital Libraries (WADL) Workshop 2018 that occurred in Fort Worth, Texas. ⓖⓞⓖⓐⓣⓞⓡⓢ

On June 6, 2018, after attending JCDL 2018 (trip report), WS-DL members attended the Web Archiving and Digital Libraries 2018 Workshop (#wadl2018) in Fort Worth, Texas (see trip reports from WADL 2017, 2016, 2015, 2013). WS-DL's contributions to the workshop included multiple presentations inclusive of the workshop keynote by my PhD advisor, which I discuss below.

.@mart1nkle1n getting us started at #WADL2018 #jcdl2018 https://t.co/SctiuI9foX pic.twitter.com/Kbj4ofRwRH
— Michael L. Nelson (@phonedude_mln) June 6, 2018

The Project Panel

Martin Klein (@mart1nkle1n) initially welcomed the workshop attendees and had the group of 26-or-so participants give a quick overview of who they were and their interest in attending. He then introduced Zhiwu Xie (@zxie) of Virginia Tech to begin the series of presentations reporting on the project kickoff of the IMLS-funded project (as establish at WADL 2017) "Continuing Education to Advance Web Archiving". A distinguishing feature of this project compared to others, Zhiwu said, is that the projects will use project-based problem solving instead of the products being surveys and lectures. He highlighted a collection of Curriculum modules involving existing practice (event archiving) to feed into various Web archiving tools (e.g., Social Feed Manager (SFM), ArchiveSpark, and Archives Unleashed Toolkit) to facilitate the understanding of the fundamentals (e.g., web, data science, big data) to produce experience in libraries, archives, and programming. The focus here was on individuals that had some level of prior experience with archives instead of the program being designed as training for those with zero experience in the area.

Now @zxie kicks off our IMLS project - “Continuing Education to Advance Web Archiving.” Funded under the Laura Bush 21st Century Librarian Program. We’ll be developing some #webarchiving training. #wadl2018 #jcdl2018 pic.twitter.com/9ihZejeUrm
— Ian Milligan (@ianmilligan1) June 6, 2018

@phonedude describing some approaches that may be used for storytelling with archives now that @storify is gone. #wadl2018 pic.twitter.com/E8joxzeoOF
— Mat Kelly (@machawk1) June 6, 2018

ODU WS-DL's Michael Nelson (@phonedude_mln) continued with the one motivation is to encourage storytelling using Web archives and how that has been hampered with the recent closing of Storify. Some recent work of the group (including the in-develop project MementoEmbed) would allow this concept to be revitalized despite Storify's demise through systematic "card" generation of mementos to allow a more persistent (in the preservation sense) version of the story to be extracted and retained.

@justin_littman starts his #wadl2018 talk in distinguishing what you get from the Twitter API compared to what you would see on their Web interface. He then highlights some tools to interact with the data then describes doing so with their @US_IMLS-supported Social Feed Manager. pic.twitter.com/8Y1uYSMmta
— Mat Kelly (@machawk1) June 6, 2018

Justin Littman (@justin_littman) of George Washington University Libraries continued the project description by describing Social Feed Manager's and emphasized that what you get from the Twitter API may well differ from what you get from the Web interface. The purpose of SFM is to be an easy-to-use, self-service Web interface to drive down the barriers in collecting social media data for academic research.

At #jcdl2018 #wadl2018 gave a quick run down of our @unleasharchives projects – for some more info as promised on the research approach to using our tools, here’s the “FAAV” cycle. pic.twitter.com/68gxWLn0eG
— Ian Milligan (@ianmilligan1) June 6, 2018

Ian Milligan (@ianmilligan1) continued by giving a quick run-down of his group's Archives Unleashed Projects, noting a realization in the project's development that not all historians like working with the command-line and Scala. He then briefly described the projects' approach of a filter-analyze-aggregate-visualize to make using large collections of Web archives more effective for research.

Edward A. Fox presents work done archiving events @virginia_tech #WADL2018 #jcdl2018 https://t.co/XfbWQILkDM pic.twitter.com/CY18bk9Cbv
— Shawn M. Jones (@shawnmjones) June 6, 2018

Wrapping up the project report, Ed Fox described Virginia Tech's initial attempts at performing crawls with Heritrix via Archive-It and how noisy the results were. He emphasized that a typical crawling approach consisting of starting with seed URIs harvested from tweets does not work well. The event model his group is developing and further evaluating will help guide the crawling procedure.

Ed's presentation completed the series of reports for the IMLS project panel and began a series of individuals presenting.

#jcdl2018 #wadl2018 @johnaberlin presenting: "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay"

Based on his @WebSciDL MS thesis https://t.co/xeAlv0pVVF pic.twitter.com/FDcxodBuFu
— Michael L. Nelson (@phonedude_mln) June 6, 2018

Individual Presentations

John Berlin (@johnaberlin) started off with an abbreviated version of his Master's Thesis titled, "Swimming In A Sea Of JavaScript, Or: How I Learned To Stop Worrying And Love High-Fidelity Replay". While John had recently given his defense in April (see his post for more details), this presentation focused on some of the more problematic aspects of archival replay as caused by JavaScript. He highlighted specific instances where the efforts of a replay system to accurately replay JavaScript varied from causing a page to display a completely blank viewport (see CNN.com has been unarchivable since November 1st, 2016) to the representation being highjacked to declare Brian Williams as the originator of "Gin and Juice" long before Snoop Dogg(y Dogg). John has created a Chrome and Firefox extension he dubbed "Wayback Plus Plus" that mitigates JavaScript-based replay issues using client-side redirects. See his presentation for more details.

#wadl2018 @edwardafox on URLs in tweets - time interval between a tweet and its webpage - most people prefer newly released web pages - for successfully retrieved URLs 50% were archived within 5 days after tweets were posted #jcdl2018 pic.twitter.com/aQqPIoBr13
— Shawn M. Jones (@shawnmjones) June 6, 2018

The workshop participants then had a break to grab a boxed lunch and followed with Ed Fox, again, presenting "A Study of Historical Short URLs in Event Collections of Tweets". In this work Ed highlighted the number of tweets in their collections that had URLs, namely that 10% had 2 URLs and less than 0.5% had 3 or more. From this collection, his group analyzed how many of the URLs linked are still accessible in Internet Archive's Wayback Machine with an emphasis that the Wayback Machine is not covering a lot of things that are in the Twitter data he has gathered. His group also analyzed the time difference between when a tweet with URLs was made and when it was archived and found that 50% were archived within 5 days after the tweet was posted.

Keynote

The workshop keynote, "Enabling Personal Use of Web Archives" was next and presented by my PhD Advisor Dr. Michele C. Weigle (@weiglemc). Her presentation initially gave a high-level overview of the needs of those that want to perform personal Web archiving and the tools that the WS-DL group have created over the years in facilitating the efforts to address those needs. She highlighted the early work of the group in identifying disasters in existing archives with a segue of the realization that many archive users lack in that there are more archives beyond Internet Archive.

#wadl2018 @weiglemc mentions that there are multiple #webarchives - with ArchiveNow we can submit to many web archives at once https://t.co/uAQqOenQOe #jcdl2018 pic.twitter.com/1KqMd6b9yh
— Shawn M. Jones (@shawnmjones) June 6, 2018

In her (our) group's tooling to encourage Web users to Archive What They See Now, they created the WARCreate Chrome extension to create WARC files from any Web page. To resolve the issue of what a user is to do with their WARCs, they then created the Web Archiving Integration Layer (WAIL) (and later an Electron version) to allow individuals to control both the preservation and replay process. To give users a better picture of the archived Web as they browsed, they created the Chrome extension Mink to give users a measure of how well-archived (in terms of quantity) a URI is as they browsed the live Web and optionally (and easily) submit the URI currently viewed to 1-to-3 Web archives.

Enabling Personal Use of Web Archives from Michele Weigle

There are a lot of tools available at the @WebSciDL Group, our GitHub page has a lot of projects in progress: https://t.co/9qTndi5Vxd #wadl2018 #jcdl2018 pic.twitter.com/LZQFjaongt
— Shawn M. Jones (@shawnmjones) June 6, 2018

Dr. Weigle also highlighted the work of other WS-DL students of past and present like Yasmin Anwar's (@yasmina_anwar) Dark and Stormy Archives (DSA) and Shawn Jones' (@shawnmjones) upcoming MementoEmbed tool.

Following a tool review, Dr. Weigle asked, "What if browsers could natively interpret and replay WARCs?". She performed a high level review of what could be possible if the compatibility barriers between the archived and live Web were resolved through live Web tools that could natively interact with the archived Web. In one example, she provided a screenshot where in-place of the "secure" badge a browser provides, it might also be aware that it is viewing an archived page and indicate as such.

.@weiglemc: What if browsers could natively interpret and replay WARC files?
me: My life would be complete, that's what. Yes! #wadl2018 #JCDL2018
— Jasmine Mulliken (@jasminemulliken) June 6, 2018

Libby Hemphill (@libbyh) presented next with "Developing a Social Media Archive at ICPSR" where her group sought to make data useful for people who wanted to understand how we are today from the perspective of people of the long-distant future. She mentioned how messy it can be to consider the ethical challenges when archiving social media data and that people have different levels of comfort depending of what sort of research for which their social media content is to be used. She outlined an architecture of their social media archive SOMAR for federating data to follow the terms of service, rehydrating tweets to follow the terms of research, and other aspects of the social-media-to-research-data process.

Which looks roughly similar to the SOMAR infrastructure described by @libbyh. #wadl2018 pic.twitter.com/l7aViIBf8R
— Justin Littman (@justin_littman) June 6, 2018

The workshop then took another break with a simultaneous poster session including a poster by Justin Littman titled, "Supporting social media research at scale" and WS-DL's Sawood Alam's (@ibnesayeed) "A Survey of Archival Replay Banners". Just prior to their poster presentations, each gave a lightning talk as a quick overview to entice attendees into stopping by.

#jcdl2018 @justin_littman is doing a lightning talk for his "Supporting social media research at scale" poster at #wadl2018 pic.twitter.com/P03L2GRsuU
— Shawn M. Jones (@shawnmjones) June 6, 2018

.@ibnesayeed is presenting his #wadl2018 poster "A Survey of Archival Replay Banners" detailing issues with #webarchive banners, such as covering navigational links on a memento, and more... #jcdl2018 pic.twitter.com/OGWNWHyo5h
— Shawn M. Jones (@shawnmjones) June 6, 2018

An excellent use of a Merkle Tree (https://t.co/lngldKeXwn) by @maturban1 to generate a root hash for each memento that includes the hashes of each of its embedded resources #jcdl2018 #wadl2018 pic.twitter.com/azY4XdRIIj
— Shawn M. Jones (@shawnmjones) June 6, 2018

After the break, WS-DL's Mohamed Aturban (@maturban1) presented "It is Hard to Compute Fixity on Archived Web Pages". Mohamed's work highlighted an issue that subtle changes in content may be difficult to detect using conventional hashing methods to compute the fixity of Web pages. He emphasized that computing the fixity of the root HTML page of a memento is not enough for fixity and that the fixity must also be computed for all embedded resources. With an approach utilizing Merkle trees (or on WP), he generates a hash of the composite memento representative of the fixity of all embedded resources. In one example highlighted in his recent post and tech report, Mohamed showed the manipulation of Climate Change data.

It is hard to compute fixity on archived web pages from maturban

To wrap up the presentations for the workshop, I (Mat Kelly, @machawk1) presented "Client-Assisted Memento Aggregation Using the Prefer Header". This work highlighted one particular aspect of my presentation the previous day at JCDL 2018 (see blog post), namely of how the framework in the basis presentation facilitates the specification of which archives are aggregated using Memento. The previous investigation by Jones, Van de Sompel et al. (see "Mementos in the Raw, Take Two") used the HTTP Prefer header to allow a client to request the un-rewritten version of mementos from archival replay system. In my work, I imagined a more capable Memento aggregator that would expose the archives aggregated and allow a client, basing their customizations off of the aggregator's response, customize the set of archives aggregated by sending the set as base64-encoded data in the Prefer request header.

#jcdl2018 #wadl2018 @machawk1 is presenting "Client-Assisted Memento Aggregation Using the Prefer Header" as part of his work on aggregating memento data from multiple web archives pic.twitter.com/JmhTHdKCcA
— Shawn M. Jones (@shawnmjones) June 6, 2018

Client-Assisted Memento Aggregation Using the Prefer Header from Mat Kelly

Closing

When I was through with the final presentation, Ed Fox began the wrap-up of the conference. This discussion of all attendees opened the floor for comments and recommendations for the future of the workshop. With the discussion finished, the workshop came to a close. As usual, I found this workshop extremely informative, though I was familiar with many of the participants previous work. I am hoping, as also expressed by other attendees, to encourage other fields to become involved and present their ongoing work and ideas at this informal workshop. Doing so, from the perspective of both an attendee and presenter, has proven valuable.

—Mat (@machawk1)

Wednesday, July 5, 2017

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

Web Archiving and Digital Libraries Workshop was held after JCDL 2017 from June 6, 2017, to June 23, 2017. I live-tweeted both days and you can follow along on Twitter with this blog post using the hashtag wadl2017 or via the notes/minutes of WADL2017. I also created a list on Twitter of the speaker/presenters Twitter handles, go give them a follow to keep up to date with their exciting work.

Day 1 (June 22)

WADL2017 kicked off at 2 pm with Martin Klein and Edward Fox welcoming us to the event by giving an overview and introduction to the presenters and panelists.

@mart1nkle1n kicks off a #JCDL2017 attached session by scrbblinging #WADL2017 hashtag on the blackboard. pic.twitter.com/CUN3fPKS4l
— Sawood Alam (@ibnesayeed) June 22, 2017

Keynote

The opening keynote of WADL2017 was National Digital Platform (NDP), Funding Opportunities, and Examples Of Currently Funded Projects by Ashley Sands (IMLS).

@ashley247 with her opening keynote at #wadl2017 @US_IMLS #jcdl2017 #wadl2017 pic.twitter.com/c0w5mZZGNF
— Martin Klein (@mart1nkle1n) June 22, 2017

In the keynote Sands spoke about the desired values for the national digital platform, how IMLS offers various grant categories and funding opportunities for archiving projects, and the submission procedure for grants as well as tips to writing IMLS grant proposals. Sands also shared what a successful (funded) proposal looks like, and how to apply to become a reviewer of the proposals!

.@ashley247 desired values for national digital platform #wadl2017 pic.twitter.com/XFuIcRjc2z
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 shares funding opportunities for archiving projects @ #WADL2017 #JCDL2017 pic.twitter.com/HbCL4DkZ54
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Very helpful tips and recs for IMLS grant proposals! Thanks for a super informative prez @ashley247 #WADL2017 #jcdl2017 pic.twitter.com/KOIdYyou86
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

.@ashley247 successful proposal from 2015. "Combining Social Media Storytelling with Web Archives" @WebSciDL + @archiveitorg #wadl2017 pic.twitter.com/t43S6cZFOq
— John Berlin (@johnaberlin) June 22, 2017

.@ashley247 apply to be a reviewer, more voices more diveristy #wadl2017 pic.twitter.com/rrRUXS8yys
— John Berlin (@johnaberlin) June 22, 2017

Lightning Talks

First up in the lightning talks was Ross Spenser from the New Zealand Web Archive on "HTTPreserve: Auditing Document-Based Hyperlinks" (poster)

.@beet_keeper from New Zealand Web Archive presenting @httpreserve #wadl2017 pic.twitter.com/1z3x9VE7Zn
— John Berlin (@johnaberlin) June 22, 2017

Finally get to meet @beet_keeper! He’s presenting on HTTPreserve, repo at https://t.co/uDJKJ2OCQY. #WADL2017 pic.twitter.com/G92rNVc98N
— Ian Milligan (@ianmilligan1) June 22, 2017

Spenser has created a tool that will check the status of a URL on the live web and if it has been archived by the Internet Archive (httpreserve) which is a part of a large suite of tools under the same name. You can try it out via httpreserve.info and the project is open to contributions from the community as well!

GREAT!! Awesome to know about projects that welcome issues and PRs! #wadl2017
— John Berlin (@johnaberlin) June 23, 2017

The second talk was Muhammad Umar Qasim on "WARC-Portal: A Tool for Exploring the Past". WARC-Portal is a tool that seeks to provide access for researchers to browse and search through custom collections and provides tools for analyzing these collections via Warcbase.

WARC-Portal: A Tool for Exploring the Past #wadl2017 pic.twitter.com/2RsUADChhO
— John Berlin (@johnaberlin) June 22, 2017

Third talks was by Sawood Alam on "The Impact of URI Canonicalization on Memento Count". Alam spoke about the ratio of representations vs redirects obtained from dereferencing each archived capture. For a more detailed explanation of this you can read our blog post or the full technical report.

Impact of URI Canonicalization on Memento Count from Mat Kelly

The final talks was by Edward Fox on "Web Archiving Through In-Memory Page Cache". Fox spoke about Nearline vs. Transactional Web Archiving and the advantages of using a Redis cache.

Paper Sessions

First, up for in paper sessions was Ian Milligan, Nick Ruest and Ryan Deschamps on "Building a National Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Project"

.@ianmilligan1 @ruebot @RyanDeschamps tag team presentation "The WALK Project"#wadl2017 pic.twitter.com/9LkibDm73F
— John Berlin (@johnaberlin) June 22, 2017

The WALK project seeks to address the issue of "To use Canadian web archives you have to really want to use them, that is you need to be an expert" by "Bringing Canadian web archives into a centralised portal with access to derivative datasets".

And now a joint prez on Building a Nat'l Web Archiving Collaborative Platform: Web Archives for Longitudinal Knowledge Proj. #WADL2017 pic.twitter.com/Ya5HZOTt4A
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

Enter WALK: 61 collections, 16 TB of WARC files, developed new Solr front end based on Project Blacklight (currently indexed 250 million records). The WALK workflow consists of using Warcbase and a handful of other command line tools to retrieve data from the Internet Archive, generate scholarly derivatives (visualizations, etc) automatically, upload those derivatives to Dataverse and ensure the derivatives are available to the research team.

.@RyanDeschamps on the WALK workflow, including description of collection with network graphs #WADL2017 pic.twitter.com/cgkRhtCzLV
— Emily Maemura (@emilymaemura) June 22, 2017

To ensure that WALK could scale the WALK project will be building on top of Blacklight and contributing it back to the community as WARCLight.

.@ruebot shows (unsurprisingly) the cutest slide of #WADL2017 pic.twitter.com/g0Wob4Bwz1
— Jasmine Mulliken (@jasminemulliken) June 22, 2017

The second paper of WADL2017 presentation was by Sawood Alam on "Avoiding Zombies in Archival Replay Using ServiceWorker." Alam spoke about how through the use of ServiceWorkers URI's that were missed during rewriting or not rewritten at all due to the dynamic nature of the web can be rerouted dynamically by the ServiceWorker to hit the archive rather than the live web.

Avoiding Zombies in Archival Replay Using ServiceWorker from Sawood Alam

Ian Milligan was up next presenting "Topic Shifts Between Two US Presidential Administrations". One of the biggest questions that Milligan noted during his talk was how to proceed training a classifier if there was no annotated data by which to train it by. To address this question (issue), Milligian used bootstrapping to start off via bag of words and keyword matching. He noted that is method works with noisy but reasonable data. The classifiers were trained to look for biases in admins, Trump vs Obama seems to work with dramatic differences and the TL;DR is the classifiers do learn the biases. For more detailed information about the paper see Milligan's blog post about it.

Slides for our #wadl2017 #jcdl2017 talk: “Comparing Topic Shifts Between Two US Presidential Administrations.” https://t.co/rrDsb0HJQu pic.twitter.com/o4OiqZtLzI
— Ian Milligan (@ianmilligan1) June 22, 2017

Closing the first day of WADL2017 was Brenda Reyes Ayala with the final paper presentation on "Web Archives: A preliminary exploration vs reality". Ayala spoke about looking at Archive-It support tickets, as XML, then cleaned and anonymized then using qualitative coding, grounded theory for analysis and presented three expectations when considering user expectations, their mental models when working with archives.

The original website had X number of documents, it would also follow that the archived website also has X number of documents.

Reality: an archived website was often much larger or smaller than the user had expected.

A web archive only includes content that is closely related to the topic.

Reality: Due to crawler settings, scoping rules, and the nat-6/23ure of the web, web archives often include content that is not topic-specific. This was especially the case with social media sites. Users saw the presence of this content as being of little relevance and superfluous.

Content that looks irrelevant is actually irrelevant.

Reality: A website contains pages or elements that are not obviously important but help “behind the scenes” to make other elements or pages render correctly or function properly.

This is knowledge that is known by the partner specialist, but usually unknown or invisible to the user or creator of an archive. Partner specialists often had to explain the true nature of this seemingly irrelevant content Domains and sub-domains are the same thing, and they do not affect the capture of a website.

Reality: These differences usually affect how a website is captured.

2017-08-25 edit: Slides accompanying Ayala's talk made available. Web archives: A preliminary exploration of user expectations vs. reality hosted by The Portal to Texas History

Day 2 (June 23)

Day two started off with a panel featuring Emily Maemura, Dawn Walker, Matt Price, and Maya Anjur-Dietrich on "Challenges for Grassroots Web Archiving of Environmental Data". The first event hosted took place in December in Toronto to preserve the EPA data from the Obama administration during the Trump transition. The event had roughly two-hundred participants and during the event hundreds of press articles, tens of thousands of URL’s seeded to Internet Archive, dozens of coders building tools and a sustainable local community of activists interested in continuing the work. Since then seven events in Philly, NYC, Ann Arbor, Cambridge MA, Austin TX, Berkeley were hosted/co-hosted with thirty-one more planned in cities across the country.

Matt Price talks about very important EDGI project--keeping track of environmental data online: https://t.co/21JcMwsfBm #WADL2017 #JCDL2017
— Jasmine Mulliken (@jasminemulliken) June 23, 2017

After the panel was Tom J. Smyth on Legal Deposit, Collection Development, Preservation, and Web Archiving at Library and Archives Canada Web Archival Scoping Documents. Smyth spoke on questions about how to start building a collection for a budding web archive that does not have the scale as well as an established one and that it has:

Web Archival Scoping Documents

What priority
What type
What are we trying to document
What degree are we trying to document

Controlled Collection Metadata, Controlled vocabulary

Evolves over time with the collection topic

Quality Control Framework

Essential for setting a cut-off point for quality control

Selected Web Resources must pass four checkpoints

Is the resource in-scope of the collection and theme
(when in doubt consult the Scoping Document)
Heritage Value, is the content unique available in other formats,
(what contexts can it be used)
Technology / Preservation
Quality Control

.@smythbound gives @RyanDeschamps a #wadl2017 shoutout for his "topic Jeopardy," when thinking about curating collections. pic.twitter.com/k26LhCdObo
— Ian Milligan (@ianmilligan1) June 23, 2017

@smythbound you won the cuteness round of #Zombies and #Unicorns against @ibnesayeed at #WADL2017 #JCDL2017 pic.twitter.com/xqE9yR22aL
— Sawood Alam (@ibnesayeed) June 23, 2017

The next paper presenters up were Muhammad Umar Qasim and Sam-Chin Li for "Working Together Toward a Shared Vision: Canadian Government Information Digital Preservation Network (CGI - DPN)". The Canadian Government Information Digital Preservation Network (CGI - DPN) is a project that seeks to preserve digital collections of government information and ensure the long-term viability of digital materials through geographically dispersed servers, protective measures against data loss, and forward format migration. The project will also as a backup server in cases where the main server is unavailable as well as act as a means of restoring lost data. To achieve the goals the project is using Archive-It for the web crawls and collection building then using LOCKSS to disseminating the collections to additional peers (LOCKSS nodes).

Sam-chin Li and Muhammed Umar Qasim "Working Together Towards A Shared Vision" #wadl2017 pic.twitter.com/oLN8o4ftdb
— John Berlin (@johnaberlin) June 23, 2017

Nick Ruest was up next speaking on "Strategies for Collecting, Processing, and Analyzing Tweets from Large Newsworthy Events". Ruest spoke about how Twitter is big data and handling the can be difficult. Ruest also spoke about how to handle the big Twitter data in a sane manner by using tools such as Hydrator or Twarc from the DocNow project.

.@ruebot Strategies for handling large news worthy tweet collections @documentnow #wadl2017 pic.twitter.com/Mu0j9Lzhtw
— John Berlin (@johnaberlin) June 23, 2017

Here are my #WADL2017 slides if you want follow along at home.https://t.co/PZtJzhxhMH

❤️ @documentnow
— nick ruest (@ruebot) June 23, 2017

The final paper presentation of the day was Saurabh Chakravarty, Eric Williamson, and Edward Fox on "Classification of Tweets using Augmented Training". Chakravarty discussed using the cosine similarity measure on Word2Vec based vector representation of tweets and how it can be used to label unlabeled examples. How training a classifier using Augmented Training does provide improvements in classification efficacy and how a Word2Vec based representation generated out of a richer corpus like Google News provides better improvements with augmented training.

"Classification Of Tweets Using Augmented Training" Datasets #wadl2017 pic.twitter.com/pLKywUe3YP
— John Berlin (@johnaberlin) June 23, 2017

This is a great project, given how hard it is to classify short tweets – they train a dataset, then use on auxiliary tweets. #wadl2017 pic.twitter.com/wFPDFhH31o
— Ian Milligan (@ianmilligan1) June 23, 2017

Closing Round Table On WADL

The final order of business for WADL 2017 was a round table discussion with the participants and attendees concerning next years WADL and how to make WADL even better. There was a lot of great ideas and suggestions made as the round table progressed with the participants of this discussion becoming the most excited about the following:

WADL 2018 (naturally of course)
Seeking out additional collaboration and information sharing with those who are actively looking for web archiving but are unaware of / did not meet up for WADL
Looking into bringing proceedings to WADL, perhaps even a journal
Extending the length of WADL to a full two or three day event
Integration of remote site participation for those who wish to attend but can not due to geographical location or travel expenses

Till Joint Conference on Digital Libraries 2018 June 3 - 7 in Fort Worth, Texas, USA
- John Berlin

Friday, June 24, 2016

2016-06-24: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2016

Trip Report for the Web Archiving and Digital Libraries Workshop 2016 in Newark, NJ.

Following our recent trip to the Joint Conference on Digital Libraries (JCDL), WS-DL attempted the Web Archiving and Digital Libraries (WADL) Workshop (#wadl2016) (see 2015 and 2013 reports) co-located with the conference in Newark, New Jersey. This post documents the presentations and our take-home from the workshop.

On Wednesday, June 22, 2016 at around 2pm, Ed Fox of Virginia Tech welcomed the crowd of 30+ librarians, researchers, and archivists into the two half-day event by first introducing Vinay Goel (@vinaygo) of Internet Archive (IA) to lead the panel of "Worldwide Activities on Web Archiving".

Vinay described the recent efforts by IA to make the archives more accessible through additional interfaces to the archive's holdings. As occurred at Archives Unleashed 2.0 Datathon earlier this month, Vinay demoed the then restricted access beta version of the Wayback Machine interface, now sporting a text input box that searches the contents of the archives. Vinay noted that the additional search functionality was limited to scanning the homepages using limited metadata but hopes to improve the interface before publicly deploying it.

#wadl2016 opening with @vinaygo giving a rundown of IA interfaces, projects, etc. Great first speaker! #jcdl2016 pic.twitter.com/QrT275icyA
— Ian Milligan (@ianmilligan1) June 22, 2016

.@vinaygo showed off full-text search for the Internet Archive.. hey look, I found myself! #wadl2016 #jcdl2016 pic.twitter.com/r2SygV4YEs
— Ian Milligan (@ianmilligan1) June 22, 2016

Near the end of Vinay's presentation, he asked Ian Milligan (actively tweeting the workshop per above) for a quick summary of the aforementioned Datathon. Ian spoke about the format and namely, the efforts to propagate the event into the future by registering the Archives Unleashed organization as an LLC.

Following the first presentation, Herbert Van de Sompel (@hvdsomp) presented the first full paper of the workshop, "Divising Affordable and Functional Linked Data Archives". Herbert first gave a primer of the Memento Framework then segued into the analysis he and his co-authors performed of the DBPedia archive. While originally using MongoDB to store the archival dumps, which he admitted might have not been a good design decision, Herbert described the work done to make the archive Memento-compatible.

His group performed an analysis on availability, bandwidth, and cost of making the archive accessible using a simple data dump, SPARQL endpoints, and Subject URI initially in terms of method's expressiveness, potential Memento support and ability to support cross-time data. Using the recent research performed by his intern, Miel Vander Sande (@Miel_vds), he spoke of Linked Data Fragments (described by selectors, controls, and metadata), particularly, Triple Pattern Fragments and the tradeoffs compared to the other methods of representing the linked data. Using a Binary RDF Representation of the data in combination with Triple Pattern Fragments produced a result that costs less and facilitates consumption better than the previous methods of storage. This led his team to produced a second version of the DBPedia archive with Memento support with the advantages of LDF and the Binary RDF Representation in-place.

After a short break, a series of 2-3 minute lightning talks began.

Yinlin Chen of Virginia Tech gave the first presentation with "A Library to Manage Web Archive Files in Cloud Storage". In this work, Yinlin spoke about the integration Islandora applications the Fedora Commons digital repository and various cloud providers like Amazon S3 and Microsoft Azure.

Sunshin Lee (also of VT) presented second in the lightening talks with "Archiving and Analyzing Tweets and Webpages with the DLRL Hadoop Cluster". Sunshin described VT's Integrated Digital Event Archiving and Library (IDEAL) project and their utilization of a 20 node Hadoop cluster for collecting and archiving tweets and web pages for analysis, searching, and visualizing.

I (Mat Kelly - @machawk1) presented second with "InterPlanetary Wayback: The Permanent Web Archive". In this work, which Sawood (@ibnesayeed) and I initially prototyped at the Archives Unleashed Hackathon in March, I gave high level details of integrating Web ARChive (WARC) files with the InterPlanetary File System for inherent de-duplication and dissemination of an archives' holdings. This work was also presented as a poster at JCDL 2016.

Sawood Alam presented his own work after me with "MemGator - A Portable Concurrent Memento Aggregator". His aggregator, written in the Go programming language allows for concurrent querying of a user-specified set of archives and allows users to deploy their own Memento aggregator. This work was also presented as a poster at JCDL 2016.

.@ibnesayeed's lightening talk at #wadl2016 of his @golang-based MemGator #mementoweb aggregator #iamnotagator pic.twitter.com/OTUFgmcxN7
— Mat Kelly (@machawk1) June 22, 2016

Bela Gipp (@BelaGipp) presented the final lightening talk with "Using the Blockchain of Cryptocurrencies for Timestamping Digital Cultural Heritage". In this work, he spoke of his group's work of timestamping data using the blockchain for information integrity. He has a live demo of his system available as well as an API for access.

Following Bela's talk, the session broke for a poster presentation. After resuming the talks, Richard Furuta (@furuta) presented "Evaluating Unexpected Change in a Distributed Collection". In this work they examined the ACM Digital Library's collection of proceedings to identify the complexity and categories of change. From this analysis of sites on the live web, he ran a user study to class the results into "Correct", "Kind of Correct", "University page", "Hello World", "Domain for Sale", "Error", and "Deceiving" with the cross-dimension of relevance (from Very Much to Not At All) as well as the percentage of the time the site was correctly identified.

Mark Phillips (@vphill) presented the final paper of the day, "Exploratory Analysis of the End of Term Web Archive: Comparing two collections". In this work his group examined end of (presidential) term crawls from 2008 and 2012. After the organization that originally performed the crawl in 2008 stated that they were not planning on crawling the government (.gov and .mil sites) for the 2012 election, his group planning on doing it in lieu. With the unanticipated result of the incumbent winning the 2012 election, he performed the crawls as a learning exercise for when the change of power did occur, as is guaranteed in the 2016 presidential election.

Mark performed an analysis from his 2008 and 2012 archives at UNT, which consisted of CDX files of over 160 million URIs in the 2008 collection and over 194 million URIs in the 2012 collection. The initial analysis tried to determine when the content was harvested with the goal being to crawl before and after the election as well as after the inauguration. He found distinctly dissimilar patterns in the two crawls and realized that they were way off when estimating the times that the crawls occurred that was the result of the crawler only being to archive the sites when it could based on a long queue of other crawls. He also performed a data type analysis, finding significant increases in PNGs and JPEGs and decreases in PDFs and GIFs, among other changes. The take-home from his analysis was that the selection of URIs ought to be more driven by partners and the community. His CDX analysis code is available online.

WADL Day 2

The second of two days of WADL began with Ed Fox reiterating that most users know little about the Internet beyond the web and the role of social media in the realm of web archiving. He introduced a panel of three speakers to present in sequence next.

Social media archiving panelists @liblaura @vphill and @ianmilligan1 at #wadl2016 #jcdl2016, and a great discussion pic.twitter.com/L5vvVubEm8
— Dan Kerchner (@DanKerchner) June 23, 2016

Mark Phillips began by polling the room asking if anyone else was using Twitter data in their work. Stating that his current work uses four data sets and that he will be using six more in the near future going forward, he stated, "After Nelson Mandela died he collected 10 million tweets but only made available tweet IDs as well as the mapping between various identifiers and the embedded images. He emphasized the importance in documentation, namely a README describing the data sets.

Big take home from @vphill - adding this data in READMEs alongside the deposited tweet IDs. #wadl2016 #jcdl2016 https://t.co/weGPygb8H8
— Ian Milligan (@ianmilligan1) June 23, 2016

Laura Wrubel (@liblaura) spoke next about the open source Social Feed Manager, a tool that allows users to create collections from social media platforms. She pointed out the need to provide documentation of how datasets are created for both researchers and archivists. With the recent release of version 1.0 of the software, she is hoping for feedback from collaborators.

@liblaura on fantastic new version of @SocialFeedMgr #wadl2016 #jcdl2016 pic.twitter.com/XB1rsZZhRt
— Martin Klein (@mart1nkle1n) June 23, 2016

Ian Milligan (@ianmilligan1) spoke next (slides available) about his open source strategy for documenting events. Based off a collection of more than 318,000 unique users that used the #elxn42 hashtag (for the 42nd Canadian election), he leveraged the Twitter API to fetch data, citing the time was of the essence since the data became mostly inaccessible after 7-9 days without "bags of money".

Using twarc, he was able to create his own archives to analyze the tweets using twarc-report and twarc-utilities and visualize the data. Ian reintroduced the concept of tweet "hydration" and the difference between the legality versus the ethics of storing and sharing the data, referencing the Black Twitter project out of USC. Per Twitter's TOS, the JSON cannot be stored. Stating the contrary, "We need to be very ethically conscious but if we don't collect it, archives of powerful people will be lost and we'll only have the institutional archives".

Following the panel, Ed Fox suggested the potential of creating a research consortium for data sharing. An effort to build this is in the works.

After the panel, Zhiwu Xie (@zxie) presented the first paper of the day on "Nearline Web Archiving". Citing Masanes' 2006 book Web Archiving, Zhiwu stated that many other types now exist in the data material, some of which straddle the categories Masanes defined (client, side, and transactional). "The terminology is relatively loose and not mutually exclusive", Zhiwu said.

In his work, he enabled the Apache Disk Cache and investigated using that as a basis for creating server-driven web archiving. He reiterated that this model does not fit any of Masanes' 3 categories of web archiving and provided a new set of definitions:

Online: archiving is part of the web transaction and always adds to the server load, but can only archive one response at a time
Offline: Archiving is not part of the web transaction but a separate process. Can be used for batch archiving and handle many responses at a time.
Nearline: Depends on the accumulation of web transaction, but as a seaprate process can be batched but in smaller granularity

Zhiwu spoke further about the tools he utilized by Apache to clean the cache. His prototype modified the Apache module (in the C language) to preserve a copy of the cache prior to being deleted from the system. "As far as I know, there is no strong C support for WARC files", Zhiwu said.

Brenda Reyes Ayala (@CamtheWicked) presented next with "We need new names: Applying existing models of Information Quality to web archives". In this work, Brenda stated that much of the work done by web archivists is too technical and that some thought should be done about web archiving concepts without the use of technology. "How do you determine whether this archive is good enough?", she said.

. @CamtheWicked 's paper re: How do we describe the Information Quality of web archives? #wadl2016 #jcdl2016 pic.twitter.com/8qreirpvj9
— Dan Kerchner (@DanKerchner) June 23, 2016

Brenda spoke of the notion of Information Quality (IQ), as it is usually portrayed as a multi-dimensional construct with facets such as accuracy and validity. Citing Spaniol et al.'s work on Data Quality in Web Archiving, namely the notions of archival coherence and blurriness, and WS-DL's own Ainsworth and Nelson on temporal coherence, Brenda stated that there was a lot of talk on data quality without full consideration of IQ. Defining this further, she said, "IQ is data at a very low level that has structure, change, etc. added to it for more information." She finished with asking if other types of coherence exist than temporal coherence, e.g., topical coherence.

Mohamed Farag presented the last work of the conference with "Which webpage should we crawl first? Social media-based webpage source importance guidance". In this work, he targeted focus crawling. Through experimentation using a tweet data set of the Brussels Attack, he first extracted and shorted all URIs, identified seeds, and ran event focused crawls to collect 1,000 web pages with consideration for URI uniqueness. His group used the harvest ratio to evaluate the crawler's output. From the experiment he noted biases in the data set include domain, topical biases, genre biases (e.g., news, social media), political party biases, etc.

Closing

Following Mohamed's presentation, Rick Furuta and Ed Fox spoke about the future of the WADL workshop stating that, as with recent years, there was an opportunity to collaborate again with IEEE-TCDL to increase the number of venues with which to associate the workshop. Ed suggested to move the workshop from being ad hoc year-to-year to have some sort of institutional stamp on it. "It seems we should have more official connection.", Ed said, "Directly associating with IEEE-TCDL will give more prestige to WADL".

Ed then spoke about the previous call for submissions for the IJDL issue with a focus on web archiving and the state of the submission for publication. He then had the participants count off, group together, discuss future efforts for the workshop, and present our breakout discussions to the other workshop participants.

Overall, WADL 2016 was a very productive, interesting, and enjoyable experience. The final take was that the workshop will continue to be active during the year beyond the yearly and hopefully will further facilitate research in the area of web archiving and digital libraries.

—Mat (@machawk1)

EDIT: @liblaura provided a correction to this post following publication.

Web Science and Digital Libraries Research Group

Monday, June 11, 2018

2018-06-11: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2018

The Project Panel

Individual Presentations

Keynote

Closing

Wednesday, July 5, 2017

2017-07-04: Web Archiving and Digital Libraries (WADL) Workshop Trip Report From JCDL2017

Day 1 (June 22)

Keynote

Lightning Talks

Paper Sessions

Day 2 (June 23)

Closing Round Table On WADL

Friday, June 24, 2016

2016-06-24: Web Archive and Digital Libraries (WADL) Workshop Trip Report from JCDL2016

WADL Day 2

Closing