Web Science and Digital Libraries Research Group: deferred representations

Showing posts with label deferred representations. Show all posts

Friday, April 15, 2016

2016-04-15: How I learned not to work full-time and get a PhD

ODU's commencement on May 7th marks the last day of my academic career as a student. I began my career at ODU in the Fall of 2004, graduated with my BS in CS in the Spring of 2008 at which point I immediately began my Master's work under Dr. Levinstein. I completed my MS in Spring 2010, spent the summer with June Wright (now June Brunelle), and started my Ph.D. under Dr. Nelson in the Fall of 2010 (which is referred to as the Great Bait-and-Switch in our family). I will finish in the Spring of 2016 only to return as an adjunct instruction teaching CS418/518 at ODU in the Fall of 2016.

On February 5th, I defended my dissertation "Scripts in a Frame: A Framework for Archiving Deferred Representations" (above picture courtesy Dr. Danette Allen, video courtesy of Mat Kelly). My research in the WS-DL group focused on understanding, measuring, and mitigating the impacts of client-side technologies like JavaScript on the archives. In short, we showed that JavaScript causes missing embedded resources in mementos, leading to lower quality mementos (according to web user assessment). We designed a framework that uses headless browsing in combination with archival crawling tools to mitigate the detrimental impact of JavaScript. This framework crawls more slowly but more thoroughly than Heritrix and will result in higher quality mementos. Further, if the framework interacts with the representations (e.g., click buttons, scroll, mouseover), we add even more embedded resources to our crawl frontier, 92% of which are not archived.

Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations from Justin Brunelle

En route to these findings, we demonstrated the impact of JavaScript on mementos with our now-[in]famous CNN Presidential Debate example, defined the terms deferred representations to refer to representations dependent upon JavaScript to load embedded resources, descendants to refer to client-side states reached through the execution of client-side events, and published papers and articles on our findings (including Best Student Paper at DL2014 and Best Poster at JCDL2015).

At the end of WS-DLer academic tenures, it is customary to provide lessons learned, recommendations, and recaps of their academic experiences useful to future WS-DLers and grad students. Rather than recap the work that we have documented in published papers, I will echo some of my advice and lessons learned for what it takes to be a successful Ph.D. student.

Primarily, I learned that working while pursuing a Ph.D. is a bad idea. I worked at The MITRE Corporation throughout my doctoral studies. It took a massive amount of discipline, a massive amount of sacrifice (from myself, friends, and family), a forfeiture of any and all free time and sleep, and a near-lethal amount of coffee. Unless a student's "day job" aligns or overlaps significantly with her doctoral studies (I got close, but no cigar), I strongly recommend against doing this.

I learned that a robust support system (family, friends, advisor, etc.) is essential to being a successful graduate student. I am lucky that June is patient and tolerant of my late nights and irritability during paper season, my family supported my sacrifices and picked up the proverbial slack when I was at conferences or working late, and that Dr. Nelson dedicates an exceptional portion of his time to his students. (Did I say that just like you scripted, Dr. Nelson?) I learned to challenge myself and ignore the impostor syndrome.

I learned that a Ph.D. is life-consuming, demanding of 110% of a student's attention, and hard -- despite evidence to the contrary (i.e., they let me graduate) -- they don't give these things away. I also learned about what real, capital-R "Research" involves, how to do it, and the impact that it has. This is a lesson that I am applying to my day job and current endeavors.

I learned to network. While I don't subscribe to the adage "It's not what you know, it's who you know", I will say that knowing people makes things much easier, more valuable, more impactful, and essential to success. However, if you don't know the "what", knowing the "who" is useless.

I learned that not all Ford muscle cars are Mustangs (even though they are clearly the best), that it's best to root for VT athletics (or at least pretend), that I am terrible at commas, and that giving your advisors homebrew with your in-review paper submissions certainly can't hurt; the best collaborations and brainstorming sessions often happen outside of the office and over a cup of coffee or a pint of beer.

Finally, I learned that finishing my Ph.D. before my son arrived was one of the best things I've done -- even if mostly by luck and divine intervention. I have thoroughly enjoyed spending the energy previously dedicated to staying up late, writing papers, and pounding my head against my keyboard to spending time with June, Brayden, and my family.

Despite these hard lessons and a difficult ~5 years, pursuing a doctorate has been a great experience and well worth the hard work. I look forward to continued involvement with the WS-DL group, ODU, my dissertation committee, and sharing my many lessons learned with future students.

--Dr. Justin F. Brunelle

Thursday, November 5, 2015

2015-11-06: iPRES2015 Trip Report

From November 2nd through November 5th, Dr. Nelson, Dr. Weigle, and I attended the iPRES2015 conference at the University of North Carolina Chapel Hill. This served as a return visit for Drs. Nelson and Weigle; Dr. Nelson worked at UNC through a NASA fellowship and Dr. Weigle received her PhD from UNC. We also met with Martin Klein, a WS-DL alumnus now at the UCLA Library. While the last ODU contingent to visit UNC was not so lucky, we returned to Norfolk relatively unscathed.

Cal Lee and Helen Tibbo opened the conference with a welcome on November 3rd, followed by Nancy McGovern's keynote address delivered with Leo Konstantelos and Maureen Pennock. This was not a traditional keynote, but instead an interactive dialogue in which several challenge areas were presented to the audience, and the audience responded -- live and on twitter -- significant achievements or advances in those challenge areas from #lastyear. For example, Dr. Nelson identified the #iCanHazMemento utility. The responses are available on Google Docs.

Archiving links in twitter with #icanhazmemento #ipres2015 #lastyear https://t.co/A39u7a8VPv
— Michael L. Nelson (@phonedude_mln) November 3, 2015

I attended the Institutional Opportunities and Challenges session to open the conference. Kresimir Duretec presented "Benchmarks for Digital Preservation Tools." His presentation touched on how we can get digital preservation tools that "Just Work", including benchmarks for evaluating tools on test beds and measuring them for quality. Related to this is Mat Kelly's work on the Archival Acid Test.

Another web archiving tool comparison from @WebSciDL -- archival acid test #ipres2015 https://t.co/oNEJgJHXyy
— Justin F Brunelle (@justinfbrunelle) November 3, 2015

Alex Thirifays presented "Towards a Common Approach for Access to Digital Archival Records in Europe." This paper touched on user access: user needs, best practices for identifying requirements for access, and a capability gaps analysis of current tools versus user needs.

"Developing a Highly Automated Web Archive System Based
on IIPC Open Source Software" was presented by Zhenxin Wu. Her paper outlined a framework of open source tools to archive the web using Heritrix and a SOLR index of WARCS with an enhanced interface.

Barbara Sierman closed the session with her presentation "Best Until ... A National Infrastructure for Digital Preservation in the Netherlands" focusing on user accessibility and organizational challenges as part of a national strategy for preserving digital and cultural Dutch heritage.

After lunch, I lead off the Infrastructure Opportunities and Challenges session with my paper on Archiving Deferred Representations Using a Two-Tiered Crawling Approach. We defined deferred representations as those that rely on JavaScript to load embedded resources on the client. We show that archives can use PhantomJS to create a 1.5 times larger crawl frontier than Heritrix itself, but PhantomJS crawls 10.5 times slower. We recommend using a classifier to recognize deferred representations and only use it to crawl deferred representations, mitigating the crawl slow-down while still reaping the benefits of the headless crawler.

iPRES2015: Archiving Deferred Representations Using a Two-Tiered Crawling Approach from Justin Brunelle

Douglas Thain followed with his presentation on "Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?" Similar to our work with deferred representations, his work focuses on scientific replay of simulations and software experiments. He presents several tools as part of a framework for preserving the context of simulations and simulation software, including dependencies and build information.

Hao Xu presented "A Method for the Systematic Generation of Audit Logs in a Digital Preservation Environment and Its Experimental Implementation In a Production Ready System". His presentation focuses on a construction of a finite state machine to understand whether a repository is following compliance policies for auditing purposes.

Jessica Trelogan and Lauren Jackson presented their paper Preserving an Evolving Collection: "“On-The-Fly” Solutions for the Chora of Metaponto Publication Series." They discussed the storage of complex artifacts of ongoing research projects in archeology with the intent of improving sharability of the collections.

To wrap up Day 1, we attended a panel on Preserving Born-Digital News consisting of Edward McCain, Hannah Sommers, Christie Moffatt, Abigail Potter (moderator), Stéphane Reecht, and Martin Klein. Christie Moffatt identified the challenges with archiving born-digital news material, including the challenges with scoping a corpus. She presented their case study on the Ebola response. Stéphane Reecht presented the work by the BnF regarding their work to perform massive, once-a-year crawls as well as selective, targeted daily crawls. Hannah Sommers provided insight into the culture of a news producer (NPR) on digital preservation. Martin Klein presented SoLoGlo (social, local, and global) news preservation, including citing statistics about the preservation of links shortened by the LA Times. Finally, Edward McCain discussed the ephemeral nature of born-digital news media, and provided examples of the sparse number of mementos in news pages in the Wayback Machine.

Marin Klein from UCLA Libraries speaking about the SoLoGlo social media project #ipres2015 #savenews pic.twitter.com/nWf6sXQX1a
— Edward McCain (@e_mccain) November 3, 2015

To kick off Day 2, Lisa Nakamura gave her opening keynote The Digital Afterlives of This Bridge Called My Back: Public Feminism and Open Access. Her talk focused on the role of Tumblr in curating and sharing a book no longer in print as a way to open the dialogue on the role of piracy and curation in the "wild" to support open access and preservation.

I attended the Dimensions of Digital Preservation session, which began with Liz Lyon's presentation on "Applying Translational Principles to Data Science Curriculum Development." Her paper outlines a study to help revise the University of Pittsburgh's data science curriculum. Nora Mattern took over the presentation to discuss the expectations of the job market to identify the skills required to be a professional data scientist.

Elizabeth Yakel presented "Educational Records of Practice: Preservation and Access Concerns." Her presentation outlined the unique challenges with preserving, curating, and making available educational data. Education researchers or educators can use these resources to further their education, reuse materials, and teach the next generation of teachers.

Emily Maemura presented "A Survey of Organizational Assessment Frameworks in Digital Preservation." She presented the results of a survey focusing on frameworks for assessment models, drawing conclusions like software maturity models do for computer scientists. Further, her paper identifies trends, gaps, and models for assessment.

Matt Schultz, Katherine Skinner, and Aaron Trehub presented "Getting to the Bottom Line: 20 Digital Preservation Cost Questions." Their questions help institutions evaluate cost, including questions about storage fees, support, business plans, etc. to help institutions assess their approach to taking on digital preservation.

After lunch, I attended the panel on Long Term Preservation Strategies & Architecture: Views from Implementers consisting of Mary Molinaro (moderator), Katherine Skinner, Sibyl Schaefer, Dave Pcolar, and Sam Meister. Sibyl Schaefer lead off with a presentation of details on Chronopolis and ACE audit manager. Dave Pcolar followed by presenting the Digital Preservation Network (DPN) and their data replication policies for dark archives. Sam Meister discussed the BitCurator Consortium which helps with the acquisition, appraisal, arrangement and descriptions, and access of archived material. Finally, Katherine Skinner presented the MetaArchive Cooperative and their activities teaching institutions to perform their own archiving, along with other statistics (e.g., the minimum number of copies to keep stuff safe is 5).

Day 2 concluded with the poster session (including a poster by Martin Klein) and reception.

.@mart1nkle1n hashtag poster session hashtag minute madness #ipres15 pic.twitter.com/byQdiYlH7V
— Justin F Brunelle (@justinfbrunelle) November 4, 2015

Pam Samuelson opened Day 3 with her keynote Mass Digitization of Cultural Heritage: Can Copyright Obstacles Be Overcome? Her keynote touched on the challenges with preserving cultural heritage introduced by copyright, along with some of the emerging techniques to overcome the challenges. She identified duration of copyright as a major contributor to the challenges of cultural preservation. She notes that most countries have exceptions for libraries and archives for preservation purposes, and explains recent U.S. evolutions in fair use through the Google Books rulings.

After Samuelson's keynote, I concluded my iPRES2015 visit and explored Chapel Hill, including a visit to the Old Well (at the top of this post) and an impromptu demo of the pit simulation. It was very scary.

Several themes emerged from iPRES2015, including an increased emphasis on web archiving and a need to improved context, provenance, and access for digitally preserved resources. I look forward to monitoring the progress in these areas.

--Justin F. Brunelle

Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).

Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.

Fig 2. The user selects the Make option, which initiates an Ajax request...

Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.

Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.

If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.

The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (Browsertrix, WebRecorder.io, CrawlJAX) but these are slightly outside the scope of what we want to do. We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:	PhantomJS	Selenium	VisualEvent
Operation	Headless	Full-Browser	JavaScript bookmarklet and code
Speed (seconds)	2.5-8	4-10	< 1 (on user click)
DOM Integration	Close integration	3rd party	Close integration/embedded
DOM Event Extraction	Semi-reliable	Semi-reliable	100% reliable
DOM Interaction	Scripted, native, on-demand	Scripted	None

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack Overflow, Real Python, Vilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).

Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.

VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle