Web Science and Digital Libraries Research Group: research

Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

My research and niche within the WS-DL research group focuses on understanding how the adoption of JavaScript and Ajax is impacting our archives. I leave the details as an exercise to the reader (D-Lib Magazine 2013, TPDL2013, JCDL2014, IJDL2015), but the proverbial bumper sticker is that JavaScript makes archiving more difficult because the traditional archival tools are not equipped to execute JavaScript.

For example, Heritrix (the Internet Archive's automatic archival crawler) executes HTTP GET requests for archival target URIs on its frontier and archives the HTTP response headers and the content returned from the server when the URI is dereferenced. Heritrix "peeks" into embedded JavaScript and extracts any URIs it can discover, but does not execute any client-side scripts. As such, Heritrix will miss any URIs constructed in the JavaScript or any embedded resources loaded via Ajax.

For example, the Kelly Blue Book Car Values website (Figure 1) uses Ajax to retrieve the data to populate the "Model" and "Year" drop down menus when the user selects an option from the "Make" menu (Figures 2-3).

Fig 1. KBB.com uses Ajax to retrieve data for the drop down menus.

Fig 2. The user selects the Make option, which initiates an Ajax request...

Fig 3. ... and the Model and Year data from the Ajax response is used in their respective drop down menus.

Using Chrome's Developer Tools, we can see the Ajax making a request for this information (Figure 4).

Fig 4. Ajax is used to retrieve additional data from the server and change the state of the client.

If we view a memento of KBB.com (Figure 5), we see that the drop downs are not operational because Heritrix was not able to run the JavaScript and capture the data needed to populate the drop downs.

Fig 5. The memento of KBB.com is not completely functional due to the reliance on Ajax to load extra-client data after the initial page load.

The overly-simplified solution to this problem is for archives to use a tool that executes JavaScript in ways the traditional archival crawlers cannot. (Our paper discussing the performance trade-offs and impact of using headless browsing vs. traditional crawling tools has been accepted for publication at iPres2015.) More specifically, the crawlers should make use of technologies that act more like (or load resources in actual) browsers. For example, Archive-It is using Umbra to overcome the difficulties introduced by JavaScript for a subset of domains.

We are interested in a similar approach and have been investigating headless browsing tools and client-side automation utilities. Specifically, Selenium (a client-side automation tool), PhantomJS (a headless browsing client), and a non-archival project called VisualEvent have piqued our interest as most useful to our approach.

There are other similar tools (Browsertrix, WebRecorder.io, CrawlJAX) but these are slightly outside the scope of what we want to do. We are currently performing research that requires a tool to automatically identify interactive elements of a page, map the elements to a client-side state, and recognize and execute user interactions on the page to move between client-side states. Browsertrix uses Selenium to record HTTP traffic to create higher fidelity archives a page-at-a-time; this is an example of an implementation of Selenium, but does not match our goal of automatically running. WebRecorder.io can record user interactions and replay them with high fidelity (including the resulting changes to the representation), and matches our goal of replaying interactions; WebRecorder.io is another appropriate use-case for Selenium, but does not match our goal of automatically recognizing and interacting with interactive DOM elements. CrawlJAX is an automatic Ajax test suite that constructs state diagrams of deferred representations; however, CrawlJAX is designed for testing rather than archiving.

In this blog post, I will discuss some of our initial findings with detecting and interacting with DOM elements and the trade-offs we have observed between the tools we have investigated.

PhantomJS is a headless browsing utility that is scripted in JavaScript. As such, it provides a tight integration between the loaded page and its DOM and the code. This allows code to be easily directly injected into the target page, and native DOM interaction to be performed. As such, PhantomJS provides a better mechanism for identifying specific DOM elements and their properties.

For example, PhantomJS can be used to explore the DOM for all available buttons or button click events. In the KBB.com example, PhantomJS can discover the onclick events attached to the KBB menus. However, without external libraries, PhantomJS has a difficult time recognizing the onchange event attached to the drop downs.

Selenium is not a headless tool -- we have used the tongue-in-cheek phrase "headful" to describe it -- as it loads an entire browser to perform client-side automation. There are several APIs including Java, Python, Perl, etc. that can be used to interact with the page. Because Selenium is headful, it does not provide as close an integration between the DOM and the script as does PhantomJS. However, it provides better utilities for automated action through mouse movements.

Based on our experimentation, Selenium is a better tool for canned interaction. For example, a pre-scripted set of clicks, drags, etc. A summary of the differences between PhantomJS, Selenium, and VisualEvent (to be explored later in this post) is presented in the below table. Note that our speed testing is based on brief observation and should be used as a relative comparison rather than a definitive measurement.

Tool:	PhantomJS	Selenium	VisualEvent
Operation	Headless	Full-Browser	JavaScript bookmarklet and code
Speed (seconds)	2.5-8	4-10	< 1 (on user click)
DOM Integration	Close integration	3rd party	Close integration/embedded
DOM Event Extraction	Semi-reliable	Semi-reliable	100% reliable
DOM Interaction	Scripted, native, on-demand	Scripted	None

To summarize, PhantomJS is faster (because it's headless), and more closely integrated with the DOM than Selenium (because it loads a full browser). PhantomJS is more closely coupled with the browser, DOM, and the client-side events than Selenium. However, by using a native browser, Selenium defers the responsibility of keeping up with advances of web technologies such as JavaScript to the browser rather than maintain the responsibility within the archival tool. This will prove to be beneficial as JavaScript, HTML5, and other client-side programming languages evolve and emerge.

Sources online (e.g., Stack Overflow, Real Python, Vilimblog) have recommended using Selenium and PhantomJS in tandem to leverage the benefits of both, but this is too heavy-handed an approach for a web-scale crawl. Instead, we recommend that canned interactions or recorded and pre-scripted events be performed using Selenium and adaptive or extracted events be performed in PhantomJS.

To confirm this, we tested Selenium and PhantomJS on Mat Kelly's archival acid test (shown in Figure 6). Without a canned, scripted interaction based on a priori knowledge of the test, both PhantomJS and Selenium fail Test 2i, which is the user interaction test but pass all others. This indicates that both Selenium and PhantomJS have difficulty in identifying all events attached to all DOM elements (e.g., neither can easily detect the onchange event attached to the KBB.com drop downs).

Fig 6. The Acid Test is identical for PhantomJS and Selenium, failing the post-load interaction test.

VisualEvent is advertised as a bookmarklet-run solution for identifying client-side events, not an archival utility, but can reliably identify all of the event handlers attached to DOM elements. To improve the accuracy of the DOM Event Extraction, we have been using VisualEvent to discover the event handlers on the DOM.

VisualEvent has a reverse approach to discovering the event handlers attached to DOM elements. Our approach -- which was ineffective -- was to use JavaScript to iterate through all DOM elements and try to discover the attached event handlers. VisualEvent starts with the JavaScript, gets all of the JavaScript functions and understands which DOM elements reference those functions and determines whether these are event handlers. VisualEvent then displays the interactive elements of the DOM (Figure 7) and their associated event handler functions (Figure 8) visually through an overlay in the browser. We removed the visual aspects and leverage the JavaScript functions to extract the interactive elements of the page.

Fig 7. VisualEvent adds a DIV overlay to identify the interactive elements of the DOM.

Fig 8. The event handlers of each interactive elements are pulled from the JavaScript and displayed on the page, as well.

We use PhantomJS to inject the VisualEvent code into a page, extract interactive elements, and use PhantomJS to interact with those interactive elements. This discovers states on the client that traditional crawlers like Heritrix cannot capture.Using this approach, PhantomJS can capture all interactive elements on the page, including the onchange events attached to the drop downs menus on KBB.com.

So far, this approach provides the fastest, most accurate ad hoc set of DOM interactions. However, this is a recommendation from our personal experience for our use case: automatically identifying a set of DOM interactions; other experiment conditions and goals may be better suited for Selenium and other client-side tools.

Note that this set of recommendations is based on empirical evidence and personal experience. It is not meant as a thorough evaluation of each tool, but hope that our experiences are beneficial for others.

--Justin F. Brunelle

Sunday, August 28, 2011

2011-08-28: KDD 2011 Trip Report

Author: Carlton Northern

The SIGKDD 2011 conference took place August 21 - 24 at the Hyatt Manchester in San Diego, CA. Researchers from all over the world interested in knowledge discovery and data mining were in attendance. This conference in particular has a heavy statistical analysis flavor and many presentations were math intensive.

I was invited to present my masters project research at the Mining Data Semantics (MDS2011) Workshop of KDD. In this paper, we present an approach to find social media profiles of people from an organization. This is possible due to the links created between members an organization. For instance, co-workers or students will likely friend each other creating hyperlinks between their respective accounts. These links, if public, can be mined and used to disambiguate other profiles that may share the same names as those individuals we are searching for. The following figure shows the amount of profiles found from the ODU Computer Science student body for each respective social media site and the links found between them.

This picture represents the actual students themselves and the links between them. Black nodes are undergrads, green nodes are grads, and red nodes are members of the WS-DL research group.

These are the slides:

An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

View more presentations from carlton.northern.

Here is the paper:

MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

View more documents from carlton.northern.

I've synopsized some of the interesting presentations from the conference:

Stephen Boyd - Stanford University "From Embedded Real-Time to Large-Scale Distributed". Stephen Boyd's talk focused on his current research area of convex optimization. He explained that convex optimization is a mathematical technique in which many complex problems of model fitting, resource allocation, engineering design, etc. can be transformed to a simple convex optimization problem to be solved and then transformed back into the original problem to get the solution. He went on to explain how this can be implemented in real-time embedded systems sych as a hard disk drive head seek problem, to large distributed system such as California's power grid.

Amol Ghoting - IBM "NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce". Use Hadoop to write a map function and a reduce function where you can map anything to a (key, value) pair. The problem with Hadoop is that it has a two-stage data flow which can be cumbersome for programming. Also, job scheduling and data mangement is handled by the user. Lastly, code-reuse and portability is diminished. This toolkit tries to make the key features of Hadoop available to developers but without a Hadoop specific implementation. NIMBLE actually decouples algorithm computation from data management, parallel communications and control. It does this through using a series of basic datasets and basic tasks that create a DAG. Tasks can spawn other tasks. With this structure in place, simultaneous data and tasks parallelism is achievable.

David Haussler – UC Santa Cruz “Cancer Genomics”. DNA sequencing cost has reduced dramatically. DNA sequencing was following Moore’s law but is now reducing cost 10 fold every two years. Can now cheaply sequence entire genomes. Created the Cancer Genome Atlas. 10,000 tumors will be sequenced in the next two years using this Atlas. Cancer genome sequencing will soon be a standard clinical practice. Because each persons DNA is different, and each tumor resulting from a persons DNA is different, a huge computational processing problem looms in the near distant future.

Ahmed Metwally - Google. "Estimating the number of people behind an IP Address". Most research assumes that there is 1 person using 1 IP address, but this is not the case. IP's also change size of users, for instance, a hotel with a conference will have many more users possibly using the same IP address than usual. So, how would one estimate the amount of these users in a non-intrusive way? One method is to look at trusted cookie counts. Another method is to look at diverse traffic. Google caps traffic volume per IP to stop people from gaming the system using the same IP address. Google knows how many users share an IP address because they are logged in with a username and password to Googles sites. However, some of Googles traffic is from users that don't have a Google account. This research is for those who want to filter users without asking them for any identification, thus preserving their privacy. This method is currently being used at Google for determining click fraud.

D. Scully - Google "Detecting Adverserial Advertisements in the Wild". An adversarial advertiser would be an advertiser that uses Google AdWords or AdSense to advertise misleading products like counterfeit goods or scams. Most ads are good, only a small amount are bad. Using in-house trained people to hand build rule based models. Allowing these people to hand-build the rules gave a great incentive and improved morale rather than just having them do repetitive tasks over and over again. Automated methods are being used as well, but this part of the presentation went right over my head.

Chunyu Luo - University of Tennessee "Enhanced Investment Decisions in P2P Lending: An Investor Composition Perspective". In this paper, they are trying to decide which loans are worthwhile to invest, in other words, what makes a good loan? Use a bipartite investment network with one side investors and the other investees and the edges between them loans. Each loan can be considered a composition of many investors. The idea is that by looking at the past performance of the other investors of a given loan, you can improve your prediction of the return rate for that loan. Performed experiment from dataset of prosper.com. The composition method far outperformed the average return of investment.

Susan Imberman - College of Staten Island "From Market Baskets to Mole Rats: Using Data Mining Techniques to Analyze RFID Data Describing Laboratory Animal Behavior". This paper presents the data mining techniques used in analyzing RFID data from a colony of Mole Rats. Much like we use RFID in cars for tolls like EZ Pass, they are using RFID on Mole Rats and when they pass specific points of the colony (a series of pipes and rooms) they collect that sample. They used k-means clustering which showed animal place preference. Used an adjacency matrix to get an idea of which Mole Rats liked to be near one another. This created 3 distinct sub graphs which corresponded well to the different colony structure of Mole Rats, queen workers, large workers and small workers. Next they correlated common transactions made in the grocery store with items in a basket to repeat behavior of Mole Rats.

After the conference ended on Wednesday, Hurricane Irene was on track for a direct hit to Hampton Roads. My flight was scheduled to arrive in Norfolk Friday night which was cutting it very close to the storm hitting on Saturday. So I decided to extend the trip till Monday and ride out the storm here in sunny San Diego. In total, I managed to miss a hurricane, a tornado, an earthquake, and a swamp fire. I think I made a good decision...

Web Science and Digital Libraries Research Group

Friday, June 26, 2015

2015-06-26: PhantomJS+VisualEvent or Selenium for Web Archiving?

Sunday, August 28, 2011

2011-08-28: KDD 2011 Trip Report