Showing posts with label Visualization. Show all posts
Showing posts with label Visualization. Show all posts

Friday, June 9, 2017

2017-06-09: InfoVis Spring 2016 Class Projects

I'm way behind in posting about my Spring 2016 offering of CS 725/825 Information Visualization, but better late than never. (Previous semester highlights posts: Spring 2015, Spring/Fall 2013, Fall 2012, Fall 2011)
 
Here are a few projects that I'd like to highlight. (All class projects are listed in my InfoVis Gallery.)

Expanding the WorldVis Simulation
Created by Juliette Pardue, Mridul Sen, Christos Tsolakis


This project (available at http://ws-dl.cs.odu.edu/vis/world-vis/) was an extension of the FluNet visualization, developed as a class project in 2013. The students extended the specialized tool to account for general datasets of quantitative attributes per country over time and added attributes based on continent average. They also computed summary data for each dataset for each year, so at a glance, the user can see statistical information including the country with the minimum and maximum value.

This work was accepted as a poster to IEEE VIS 2016:
Juliette Pardue, Mridul Sen, Christos Tsolakis, Reid Rankin, Ayush Khandelwal and Michele C. Weigle, "WorldVis: A Visualization Tool for World Data," In Proceedings of IEEE VIS. Baltimore, MD, October 2016, poster abstract. (PDF, poster, trip report blog post)



Visualization for Navy Hearing Conservation Program (HCP)
Created by Erika Siregar (@erikaris), Hung Do (@hdo003), Srinivas Havanur


This project (available at http://www.cs.odu.edu/~hdo/InfoVis/navy/final-project/index.html) was also the extension of previous work. The first version of this visualization was built by Lulwah Alkwai.

The aim of this work is to track hearing level of workers in the US Navy over a period of time through Hearing Conservation Program (HCP). The HCP's goal is to detect and prevent a noise-induced hearing loss among the service members by analyzing their hearing levels over the years. The students analyzed the data obtained from the audiogram dataset to produce some interactive visualizations using D3.js to see hearing curves of workers over the years.



ODU Student Demographics
Created by Ravi Majeti, Rajyalakshmi Mukkamala, Shivani Bimavarapu

This project (available at http://webspace.cs.odu.edu/~nmajeti/InfoViz/World/worldmap-template.html) concentrates on ODU international student information. It visualizes the headcount of international graduate and undergraduate students studying at ODU for each country for a particular major in a selected year and visualizes the gender ratio for undergraduate and graduate students in the university for each year. The main goal is to provide an interactive interface for the prospective students to analyze the global diversity at ODU and identify whether ODU best suits their expectations in the aspects of alumni from their respective major and country.



Visualizing Web Archives of Moderate Size
Created by John Berlin (@johnaberlin), Joel Rodriguez-Ortiz, Dan Milanko


This work (available at http://jrodgz.github.io/project/), develops a platform for understanding web archives in a multi-user setting. The students used contextual data provided during the archival process to provide a new approach towards identifying the general state of the archives. This metadata allows us to identify the most common domains, archived resources, times and tags associated with a web collection. The designed tool outlines the most important areas of focus in web archives and gives users a more clear picture of what their collections comprise of, both in specific and general terms.




-Michele

Tuesday, March 7, 2017

2017-03-07: Archives Unleashed 3.0: Web Archive Datathon Trip Report

Archive Unleashed 3.0 took place in the Internet Archive, San Francisco, CA. The workshop was two days long, February 23-24, 2017. This workshop took place in conjunction with a National Web Symposium, hosted at the Internet Archive, February 23 – 24. Four members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend. The members are: Sawood Alam, Mohamed Aturban, Erika Siregar, and myself. This event was the third follow-up of the Archives Unleashed Web Archive Hackathon 1.0, and Web Archive Hackathon 2.0.

This workshop, was supported by the Internet ArchiveRutgers University, and the University of Waterloo. The workshop brought together a small group of around 20 researchers that worked together to develop new open source tools to web archives. The three organizers of this workshop were: Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan, (Assistant Professor, Department of History, University of Waterloo), and Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo).
It was a big moment for me as I first saw the Internet Archive building, it had an Internet Archive truck parked outside. Since 2009, the IA headquarters have been at 300 Funston Avenue in San Fransisco, a former Christian Science Church. Inside the building in the main hall there were multiple mini statues for every archivist who worked in the IA for over three years.
On Wednesday night, we had a welcome dinner and a small introduction of the members that have arrived.
Day 1 (February 23, 2017)
On Thursday, we started with a breakfast and headed to the main hall where several presentations occurred. Matthew Weber presented “Stating the Problem, Logistical Comments”. Dr. Weber started by stating the goals which include developing a common vision of web archiving development and tool development, and to learn to work with born digital resources for humanities and social science research.
Next, Ian Milligan presented “History Crawling with Warcbase”. Dr. Milligan gave an overview of Warcbase. Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The tool is used to analyze web archives using Spark, and to take advantage of HBase to provide random access as well as analytics capabilities.

Next, Jefferson Bailey (Internet Archive) presented “Data Mining Web Archives”. He talked about the conceptual issues in Access to WA which include: Provenance (much data, but not all as expected), Acquisition (highly technical; crawl configs; soft 404s), Border issues (the web never really ends), Lure of evermore data (more data is not better data), and Attestation (higher sensitivity to elision than in traditional archives?). He also explained the different formats that the Internet Archive can save its data, which include CDX, Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). In addition, he presented an overview of some research projects based on the IA collaboration. Some of the researches he mentioned were: The ALEXANDRIA project, Web Archives for Longitudinal Knowledge, Global Event and Trend Archive Research & Integrated Digital Event Archiving, and many more.
Next, Vinay Goel (Internet Archive) presented “API Overview”. He presented the Beta WayBack Machine, which searches the IA based on a URL or a word related to a sites home page. He mentioned that search results are presented based on the anchor text search.
Justin Littman (George Washington University Libraries), presented “Social Media Collecting with Social Feed Manager”. SFM is an open source software that collects social media from APIs of Twitter, Tumblr, Flickr, and Sina Weibo.

The final talk was by “Ilya Kreymer” (Rhizome), where he presents an overview of the tool “Webrecorder”. The tool provides an integrated platform for creating high-fidelity web archives while browsing, sharing, and disseminating archived content.
After that, we had a short coffee break and started to form three groups. In order to form the groups, all participants were encouraged to write a few words on the topic they would like to work on, some words that appeared were: fake news, news, twitter, etc. Similar notes are grouped together and associating members. The resulting groups were Local News, Fake News, and End of Term Transition.

Group Name Group Members
Local News:
Good News/Bad News
Sawood Alam, Old Dominion University
Lulwah Alkwai, Old Dominion University
Mark Beasley, Rhizome
Brenda Berkelaar, University of Texas at Austin
Frances Corry, University of Southern California
Ilya Kreymer, Rhizome
Nathalie Casemajor, INRS
Lauren Ko, University of North Texas
Fake News Erika Siregar, Old Dominion University
Allison Hegel, University of California, Los Angeles
Liuqing Li, Virginia Tech
Dallas Pillen, University of Michigan
Melanie Walsh, Washington University
End of Term Transition Mohamed Aturban, Old Dominion University
Justin Littman, George Washington University
Jessica Ogden, University of Southampton
Yu Xu, University of Southern California
Shawn Walker, University of Washington
Every group started to work on its dataset and brain storm different research questions to answer, and formed a plan of work. Then we basically worked all through the day, and ended the night with a working dinner at the IA.

Day 2 (February 24, 2017)
On Friday we started by eating breakfast, and then each team continued to work on their projects.
Every Friday the IA has free lunches where hundreds of people join together; some were artists, activists, engineers, librarians and many more. After that, a public tour of the IA takes place.
We had some light talks after lunch. The first talk was by Justin Littman, were he presented an overview of his new tool called “Fbarc”. This tool archive webpages from Facebook using the Graph API.
Nick Ruest (Digital Assets Librarian at York University), gave a talk on “Twitter”. Next, Shawn Walker (University of Washington), presented “We are doing it wrong!”. He explained how the current collecting process of social media is not how people view social media.

After that all the teams presented their projects. Starting with our team, we called our project "Good News/Bad News". We utilized historical captures (mementos) of various local news sites' homepages from Archive-It to prepare our seed dataset. In order to transform the data for our usage we utilized the Webrecorder, WAT converter, and some custom scripts. We extracted various headlines featured on the homepages of the each site for each day. With the extracted headlines we analyzed the sentiments on various levels including individual headlines, individual sites, and over the whole nation using the VADER-Sentiment-Analysis Python library. To leverage more machine learning capabilities for clustering and classification, we built a Latent Semantic Indexer (LSI) model using a Ruby library called Classifier Reborn. Our LSI model helped us convey the overlap of discourse across the country. We also explored the possibility of building Word2Vec model using TensorFlow for advanced machine learning, but due to limited amount of available time, despite the great potential, we could not pursue it. To distinguish between the local and the national discourse we planned on utilizing Term Frequency-Inverse Document Frequency, but could not put it together on time. For the visualization we planned on showing the interactive US map along with the heat map of the newspaper location with the newspaper ranking as the size of the spot and the color indicating if it is good news (green) or bad news (red). Also, when a newspaper is selected a list of associated headlines is revealed (color coded as Good/Bad), a Pie chart showing the overall percentage Good/Bad/Neutral, related headlines from various other news sites across the country, and a word cloud of the top 200 most frequently used words. This visualization could also have a time slider that show the change of the sentiment for the newspapers over time. We had many more interesting visualization ideas to express our findings, but the limited amount of time only allowed us to go this far. We have made all of our code and necessary data available in a GitHub repo and trying to make a live installation available for exploration soon.


Next, the team “Fake News” presented their work. The team started with the research questions: “Is it fake news to misquote a presidential candidate by just one word? What about two? Three? When exactly does fake news become fake?”. Based on these question, they hypothesis that “Fake news doesn’t only happen from the top down, but also happens at the very first moment of interpretation, especially when shared on social media networks". With this in mind, they want to determine how Twitter users were recording, interpreting, and sharing the words spoken by Donald Trump and Hillary Clinton in real time. Furthermore, they also want to find out how the “facts” (the accurate transcription of the words) began to evolve into counter-facts or alternate versions of their words. They analyzed the twitter data from the second presidential debate and focused on the most famous keywords such as "locker room", "respect for women", and "jail". The analysis result is visualized using word tree and bar chart. They also conducted a sentiment analysis which outputs a surprising result: most twitter result has positive sentiments towards the locker-room talk. Further analysis showed that apparently sarcastic/insincere comments skewed the sentiment analysis, hence the positive sentiments.


After that, the team “End of Term Transition” presented their project. The group were trying to use public archives to estimate change in the main government domains at the time of each US presidential administration transition. For each of these official websites, they planned to identify the kind and the rate of change using multiple techniques including the Simhash, TF–IDF, edit distance, and efficient thumbnail generation. They investigated each of these techniques in terms of its performance and accuracy. The datasets were collected from the Internet Archive Wayback Machine around the 2001, 2005, 2009, 2013, and 2017 transitions. The team made their work available on Github.

Finally, a surprise new team joined, it was team “Nick”. It was presented by Nick Ruest, (Digital Assets Librarian at York University). Nick has been exploring Twitter API mysteries, he showed some visualizations showing some odd peaks that occurred.

After the teams presented their work, the judges announced the team with the most points, and the winner team was “End of Term Transition”.

This workshop was extremely interesting and I enjoyed it fully. The fourth Datathon Archives Unleashed 4.0: Web Archive Datathon was announced, and will occur at the British Library, London, UK, at June 11 – 13, 2017. Thanks to Matthew Weber, Ian Milligan, and Jimmy Lin for organizing this event, and for Jefferson Bailey, and Vinay Goel, and everyone at the Internet Archive.

-Lulwah M. Alkwai

Monday, October 31, 2016

2016-10-31: Two Days at IEEE VIS 2016

I attended a brief portion of IEEE VIS 2016 in Baltimore on Monday and Tuesday (returned to ODU on Wednesday for our celebration of the Internet Archive's 20th anniversary).  As the name might suggest, VIS is the main academic conference for visualization researchers and practitioners. I've taught our graduate Information Visualization course for several years (project blog posts, project gallery), so I've enjoyed being able to attend VIS occasionally. (Mat Kelly presented a poster in 2013 and wrote up a great trip report.)

This year's conference was held at the Baltimore Hilton, just outside the gates of Oriole Park at Camden Yards.  If there had been a game, we could have watched during our breaks.





My excuse to attend this year (besides the close proximity) was that another class project was accepted as a poster. Juliette Pardue, Mridul Sen (@mridulish), and Christos Tsolakis took a previous semester's project, FluNet Vis, and generalized it. WorldVis (2-pager, PDF poster, live demo) allows users to load and visualize datasets of annual world data with a choropleth map and line charts. It also includes a scented widget in the form of a histogram showing the percentage of countries with reported data for each year in the dataset.

Before I get to the actual conference, I'd like to give kudos to whomever picked out the conference badges. I loved having a pen (or two) always handy.
Monday, October 24

Monday was "workshop and associated events" day.  If you're registered for the full conference, then you're able to attend any of these pre-main conference events (instead of having to pick and register for just one). This is nice, but results in lots of conflict in determining which interesting session to attend.  Thankfully, the community is full of live tweeters (#ieeevis), so I was able to follow along with some of the sessions I missed. It was a unique experience to be at a conference that appealed not only to my interest in visualization, but also to my interests in digital humanities and computer networking.

I was able to attend parts of 3 events:
  • VizSec (IEEE Symposium on Visualization for Cyber Security) - #VizSec
  • Vis4DH (Workshop on Visualization for the Digital Humanities) - #Vis4DH
  • BELIV (Beyond Time and Errors: Novel Evaluation Methods for Visualization) Workshop - #BELIV
VizSec

The VizSec keynote, "The State of (Viz) Security", was given by Jay Jacobs (@jayjacobs), Senior Data Scientist at BitSight, co-author of Data-Driven Security, and host of the Data-Driven Security podcast. He shared some of his perspectives as Lead Data Analyst on multiple Data Breach Investigations Reports. His data-driven approach focused on analyzing security breaches to help decision makers (those in the board room) better protect their organizations against future attacks. Rather than detecting a single breach, the goal is to determine how analysis can help them shore up their security in general. He spoke about how configuration (TLS, certificates, etc.) can be a major problem and that having a P2P share on the network indicates the potential for botnet activity.

In addition, he talked about vis intelligence and how CDFs and confidence intervals are often lost on the general public.

He also mentioned current techniques in displaying IT risk and how some organizations allow for manual override of the analysis.

And then during question time, a book recommendation: How to Measure Anything in Cybersecurity Risk
In addition to the keynote, I attended sessions on Case Studies and Visualizing Large Scale Threats.  Here are notes from a few of the presentations.

"CyberPetri at CDX 2016: Real-time Network Situation Awareness", by Dustin Arendt, Dan Best, Russ Burtner and Celeste Lyn Paul, presented analysis of data gathered from the 2016 Cyber Defense Exercise (CDX).

"Uncovering Periodic Network Signals of Cyber Attacks", by Ngoc Anh Huynh, Wee Keong Ng, Alex Ulmer and Jörn Kohlhammer, looked at analyzing network traces of malware and provided a good example of evaluation using a small simulated environment and real network traces.

"Bigfoot: A Geo-based Visualization Methodology for Detecting BGP Threats", by Meenakshi Syamkumar, Ramakrishnan Durairajan and Paul Barford, brought me back to my networking days with a primer on BGP.

"Understanding the Context of Network Traffic Alerts" (video), by Bram Cappers and Jarke J. van Wijk, used machine learning on PCAP traces and built upon their 2015 VizSec paper "SNAPS: Semantic Network traffic Analysis through Projection and Selection" (video).


Vis4DH

DJ Wrisley (@djwrisley) put together a great Storify with tweets from Vis4DH
Here's a question we also ask in my main research area of digital preservation:

A theme throughout the sessions I attended was the tension between the humanities and the technical ("interpretation vs. observation", "rhetoric vs. objective"). Speakers advocated for technical researchers to attend digital humanities conferences, like DH 2016, to help bridge the gap and get to know others in the area.

There was also a discussion of close reading vs. distant reading.

Distant reading, analyzing the structure of a work, is relatively easy to visualize (frequency of words, parts of speech, character appearance), but close reading is about interpretation and is harder to fit into a visualization.   But the discussion did bring up the promise of distant reading as a way to navigate to close reading.

BELIV
I made the point to attend the presentation of the 2016 BELIV Impact Award so that I could hear Ben Shneiderman (@benbendc) speak.  He and his long-time collaborator, Catherine Plaisant, were presented the award for their 2006 paper, "Strategies for Evaluating Information Visualization Tools: Multidimensional In-depth Long-term Case Studies".
Ben's advice was to "get out of the lab" and "work on real problems".

I also attended the "Reflections" paper session, which consisted of position papers from Jessica Hullman (@JessicaHullman), Michael Sedlmair, and Robert Kosara (@eagereyes).  Jessica Hullman's paper focused on evaluations of uncertainty visualizations, and Michael Sedlmair presented seven scenarios (with examples) for design study contributions:
  1. propose a novel technique
  2. reflect on methods
  3. illustrate design guidelines
  4. transfer to other problems
  5. improve understanding of a VIS sub-area
  6. address a problem that your readers care about
  7. strong and convincing evaluation
Robert Kosara challenged the audience to "reexamine what we think we know about visualization" and looked at how some well-known vis guidelines have either recently been questioned or should be questioned.

Tuesday, October 25

The VIS keynote was given by Ricardo Hausmann (@ricardo_hausman), Director at the Center for International Development & Professor of the Practice of Economic Development, Kennedy School of Government, Harvard University. He gave an excellent talk and shared his work on the Atlas of Economic Complexity and his ideas on how technology has played a large role in the wealth gap between rich and poor nations.





After the keynote, in the InfoVis session, Papers Co-Chairs Niklas Elmqvist (@NElmqvist), Bongshin Lee (@bongshin), and Kwan-Liu Ma described a bit of the reviewing process and revealed even more details in a blog post. I especially liked the feedback and statistics (including distribution of scores and review length) that were provided to reviewers (though I didn't get a picture of the slide). I hope to incorporate something like that in the next conference I have a hand in running.

I attended parts of both InfoVis and VAST paper sessions.  There was a ton of interesting work presented in both.  Here are notes from a few of the presentations.

"Visualization by Demonstration: An Interaction Paradigm for Visual Data Exploration" (website with demo video), by Bahador Saket (@bahador10), Hannah Kim, Eli T. Brown, and Alex Endert, presented a new interface for allowing relatively novice users to manipulate their data.  Items start out as a random scatterplot, but users can rearrange the points into bar charts, true scatterplots, add confidence intervals, etc. just by manipulating the graph into the idea of what it should look like.

"Vega-Lite: A Grammar of Interactive Graphics" (video), by Arvind Satyanarayan (@arvindsatya1), Dominik Moritz (@domoritz), Kanit Wongsuphasawat (@kanitw), and Jeffrey Heer (@jeffrey_heer), won the InfoVis Best Paper Award.  This work presents a high-level visualization grammar for building rapid prototypes of common visualization types, using JSON syntax. Vega-Lite can be compiled into Vega specifications, and Vega itself is an extension to the popular D3 library. Vega-Lite came out of the Voyager project, which was presented at InfoVis 2015. The authors mentioned that this work has already been extended - Altair is a Python API for Vega-Lite. One of the key features of Vega-Lite is the ability to create multiple linked views of the data.  The current release only supports a single view, but the authors hope to have multi-view support available by the end of the year.  I'm excited to have my students try out Vega-Lite next semester.

"HindSight: Encouraging Exploration through Direct Encoding of Personal Interaction History", by Mi Feng, Cheng Deng, Evan M. Peck, and Lane Harrison (@laneharrison), allows users to explore visualizations based on their own history in interacting with the visualization.  The tool is also described in a blog post and with demo examples.
"PowerSet: A Comprehensive Visualization of Set Intersections" (video), by Bilal Alsallakh (@bilalalsallakh) and Liu Ren, described a new method for visualizing set data (typically shown in Venn or Euler diagrams) in a rectangle format (similar to a treemap). 


On Tuesday night, I attended the VisLies meetup, which focused on highlighting poor visualizations that had somehow made it into popular media.  This website will be a great resource for my class next semester. I plan to ask each student to pick one of these and explain what went wrong.

Although I was only able to attend two days of the conference, I saw lots of great work that I plan to bring back into the classroom to share with my students.

(In addition to this brief overview, check out Robert Kosara's daily commentaries (Sunday/Monday, Tuesday, Wednesday/Thursday, Thursday/Friday) at https://eagereyes.org.)

-Michele (@weiglemc)

Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?


"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State Candidate Frequency of mentioning
the state
The most important
topic
Mississippi Kerry
85
Iraq
Mississippi Bush
131
Energy
Oklahoma Kerry
65
Jobs
Oklahoma Bush
85
Retirement
Delaware Kerry
53
Colleges
Delaware Bush
2
Other
Minnesota Kerry
155
Jobs
Minnesota Bush
303
Colleges
Illinois Kerry
86
Iraq
Illinois Bush
131
Health
Georgia Kerry
101
Energy
Georgia Bush
388
Tax
Arkansas Kerry
66
Iraq
Arkansas Bush
42
Colleges
New Mexico Kerry
157
Jobs
New Mexico Bush
384
Tax
Indiana Kerry
132
Tax
Indiana Bush
43
Colleges
Maryland Kerry
94
Jobs
Maryland Bush
213
Energy
Louisiana Kerry
60
Iraq
Louisiana Bush
262
Tax
Texas Kerry
195
Terrorism
Texas Bush
1108
Tax
Tennessee Kerry
69
Tax
Tennessee Bush
134
Teacher
Arizona Kerry
77
Iraq
Arizona Bush
369
Jobs
         ...

[3]  Interactive US map 

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).


Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.


--Mohamed Aturban

Monday, June 27, 2016

2016-06-27: Archives Unleashed 2.0 Web Archive Hackathon Trip Report


Members from WSDL who participated in the Hackathon 2.0
Last week, June 13-15, 2016, six members of Web Science and Digital Library group (WSDL) from Old Dominion University had the opportunity to attend the second Archives Unleashed 2.0 at the Library of Congress in Washington DC. This event is a follow-up to the Archives Unleashed (Web Archive Hackathon 1.0) held in March 2015 at the University of Toronto Library, Toronto, Ontario Canada. We (Mat Kelly, Alexander Nwala, John Berlin, Sawood Alam, Shawn Jones, and Mohamed Aturban) met with other participants, from various countries, who have different backgrounds -- librarians, historians, computer scientists, etc. The main goal of this event is to build tools for web archives as well as to support this kind of ongoing community to have a common vision of how to access and extract data from web collections.

This event was made possible with generous support from the National Science Foundation, the Social Sciences and Humanities Research Council of Canada, the University of Waterloo’s Department of History, the David R. Cheriton School of Computer Science and the University of Waterloo, and the School of Communication and Information at Rutgers University.

The event was organized by Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University), Ian Milligan (assistant professor, Department of History, University of Waterloo), Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo), Nicholas Worby (Government Information & Statistics Librarian, University of Toronto), and Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec). Here are some details about different activities over the three days of Hackathon 2.0.


Day 1 (June 13, 2016)

Our evening gathering in the first day of Hackathon 2.0 was at Gelman Library, George Washington University, and it was for (1) the participants to briefly introduce themselves and their area of research, and (2) forming multiple groups to work on different Hackathon projects. In order to form groups, all participants were encouraged to write a few words in three separate sticky notes describing a general topic they were interested in (e.g., topic modeling, extracting metadata, study tweets, and analysis of Supreme Court nominations), what kind of dataset they wanted to work on (e.g., collected tweets, and dataset from 2004/2008 election), and what they wished to accomplish with the selected dataset.



Participants were trying to put those sticky notes that had similar ideas together. After that, the initial groups were formed and every group was given a few minutes to introduce their project idea. By the end of the first day, we all went to a restaurant to have our dinner and socialize. Here is a list of the different groups formed initially during the first day after the brainstorming session, and I will explain later in some details what each group had accomplished:

Group Name Members
Twitter Political Organization Allie Kosterich, Nich Worby, John Berlin, Laura Wrubel, Gregory Wiedeman
Mojitos Nathalie Casemajor, Federico Nanni, Alexander Nwala, Sylvain Rocheleau, Jana Hajzlerova, Petra Galuscakova
Museum Ed Summers, Emily Maemura, Sawood Alam, Jefferson Bailey
I Know What You Hid Last Summer Mat Kelly, Shawn Walker, Keesha Henderson, Jaimie Murdock, Jessica Ogden, Ramine Tinati, Niko Tsakalakis
The Supremes Nicholas Taylor, Ian Milligan, Jimmy Lin, Patrick Rourke, Todd Suomela, Andrew Weber
Team Turtle Mohamed Aturban, Niel Chah, Steve Marti, Imaduddin Amin
Counter-Terrorism Daniel Kerchner, Emily Gade
Campaign: Origins Allison Hegel,Debra A. Riley-Huff, Justin Littman, Shawn M. Jones, Kevin Foley, Ericka Menchen-Trevino, Nick Bennett, Louise Keen


Day 2 (June 14, 2016)

Colleen Shogan, from the Library of Congress, declared the Hackathon open. Colleen mentioned that researchers who have questions about politics, history, or any other aspects related to culture memory really need specialists like us to help them access data available in different repositories such as Internet Archive and the Library of Congress. She emphasized the importance of such events and finally she thanked people who made this event possible including the organizers and the steering committee.


Matthew Weber presented the agenda of the day including presentations, a brief tour at the Library of Congress, and revising groups formed the day before. Matthew gave an example related to his dissertation work in the past illustrating how difficult it is to use web archives to answer research questions without building tools. He stated that this ongoing community is to build a common vision for web archive development tools to help accessing and extracting data, and uncover important stories from web archives. Finally, Matthew listed several kind of datasets available for the participants to work on. 2004, 2008, and 2010 election data, and the Supreme Court nominations are example of such datasets.


Ian Milligan introduced Warcbase (slides) which was developed by a team of five historians, three computer scientists, and a network scholar. Ian showed how slow it is to browse web archives using the traditional way of entering a URL in the Wayback Machine (remembering that requiring the URL itself limits what you can find in the archives). Warcbase works beyond that where it can be used to access, extract, manage, and process data from WARC files (e.g., extracting names, locations, plain text, URIs, and others from WARC files and generating different formats like network graphs or metadata in JSON). Warcbase supports filtering data based on dates, domain names, languages, etc. In addition, Warcbase is scalable which means it may run on a laptop, a powerful desktop, or on a cluster. Users may use command line tools as well as an interactive web-based interface to run Warcbase.




Jefferson Bailey and Vinay Goel from the Internet Archive presented Archive Research Services Workshop. Jefferson mentioned that the Internet Archive focuses on collecting web resources and providing access to those collections. The Internet Archive does not allow researchers to access their infrastructure to do intensive research like data mining activities. The Internet Archive has huge web collections about 13 terabyte, and it collects about a billion URIs a week. Jefferson indicated also that WARC files are huge, and it is difficult to work with such files. Also, researchers might request huge collections in WARC format, but they may end up using only a small portion. For those reasons, Internet Archive is trying to support specific research questions, so instead of providing data in WARC format, they will allow users to have access to datasets in different formats like CDX which consists of metadata about the original WARC files. Other formats include Web Archive Transformation dataset (WAT), Longitudinal Graph Analysis dataset (LGA), and Web Archive Named Entities dataset (WANE). Having such formats allows us to have smaller datasets. For example, CDX is only one percent of the size of WARC files.




Next, Vinay Goel from the Internet Archive continued on the same topic (Archive Research Services Workshop). Vinay gave a quick overview of ArchiveSpark. The tool might help the community to search and filter Internet Archive collections (e.g., filtering could be based on date, MIME type, and HTTP Response code). A research paper about ArchiveSpark was accepted and will be presented at JCDL 2016.





Abigail Grotke and Andrew Weber introduced Library of Congress Data Resources. Abigail and Andrew are working on web archiving team along with other members at the Library of Congress. They indicated that most of the crawling process is done at the Internet Archive. The Library of Congress has been collecting web resources for more than 16 years. They have made some collections available on the web. These collections are searchable (not full text search), indexed, and can be accessed by the Wayback Machine. In addition, the Library of Congress archive supports Memento. Most of the collections are not allowed to be accessed on the web due to copyright issues and permissions policy, and researchers must be physically there at the Library of Congress to access these collections.






During the coffee break, we had the opportunity to make a short tour around the Library of Congress Jefferson Building which was the first building with electricity and an elevator in use in DC.



After the coffee break, each group explained briefly their project idea, and what kind of dataset they were going to use. At this time, some participants moved to other groups as they found more interesting ideas.



While having our lunch, we were listening to the five-minute lightning talks. Nicholas Taylor (from Stanford University) introduced WASAPI, Jefferson Bailey (Internet Archive) gave a short talk about Researcher Services, Ericka Menchen-Trevino (American University) presented Web Historian, Nathalie Casemajor, Petra Galuscakova, and Sylvain Rocheleau briefly explained NUAGE, Alexander Nwala (Old Dominion University) introduced the topic Generating Collections for Stories and Event, and finally John Berlin (Old Dominion University) presented Are Wails Electric?



After that, the groups were located in different rooms based on what kind of equipment every team might need to work on the project. Each group met and had the opportunity to work for about 5 hours (there were 30 minutes coffee break after the first 2 hours) on their project ideas. By the end of the second day, we all came together around 6 PM, and each group's representative gave an update about their team's progress. Then, all participants were invited for dinner.





Day 3 (June 15, 2016)

Most of last day's time was for the groups to intensively work on their projects and produce the final results. From the time we had our breakfast at 8:30 AM at the Madison Atrium til the end of the day at 6:30, we were working on our projects except the times for the coffee break and lunch. Some participants gave five-minute lightning talks during the lunch time. The voice was not really clear at the Madison Atrium, Justin Littman was standing on a chair to deliver his talk, yet the voice still was not delivered clearly. For this reason, I will briefly mention what those talks were about.


Laura Wrubel, Daniel Kerchner, and Justin Littman from George Washington University presented an introduction to the new Social Feed Manager, a sampling of research projects supported by Social Feed Manager, and the provenance of a tweet (as inspired by web archiving). Sawood Alam from Old Dominion University introduced MemGator – A Memento Aggregator CLI and Server in Go. Jaimie Murdock from Indiana, Polygraphic and Polymathic presented the Into Thomas Jefferson’s Mind. Finally, Mat Kelly from Old Dominion University gave a short talk about Exploring Aggregation of Personal, Private, and Institutional Web Archives.




Final presentations

By the end of the day, each group presented the findings of the project that they were working on for the last couple of days:




  • Mojitos (Slides)

  • The team's goal was to detect and track the events discussed between polar media in Cuba. This was done by processing news data from the state controlled Cuban media (Granma) and a media that caters to Cuba located in Florida (el Nuevo Herald).




  • Campaign: Origins (Slides)

  • Using tweets with #election2016, @realDonaldTrump, and @HillaryClinton, this team ​searched for narratives using the content of the web pages linked to from these tweets, rather than just the tweets themselves. The tweets were collected on June 14 - 15. The team's intention is to use the Internet Archive's Save Page Now feature to capture the web pages as they are tweeted so that such a study can be repeated on a larger set of tweets in the future. They produced the following streamgraph.





  • The Supremes (Slides and more details)

  • This group has tried to analyze web archived data, provided in ARC format by the Library of Congress, about the Supreme Court nominations for Justice Alito and Justice Roberts. The size of the datasets is 92 GB containing 2.2 million pages about both Alito and Roberts. The goal of the team was to explore and analyze the data and produce more possible research questions. They used Warcbase to extract datasets from the ARC files. In addition, Warcbase can produce files in a format that can be opened directly in other platforms like Gephi.





  • I Know What You Hid Last Summer (Slides)

  • The team took Twitter datasets from the UK and Canadian Parliament members, identified the deleted tweets, noted which tweets contained links, checked if those links died after the tweet was deleted, and tried to derive meaning from the deletion. Further visualization was also done.




  • Museum

  • This team tried to analyze CDX files from the Internet Archive's IMLS Museums crawl consisting of over 219 million captures. They also utilized the Museum Universe Data File from IMLS to enrich their findings. They evaluated the proportion of various content-types (such as images or PDFs) that were crawled. They also quantified the term frequencies in the URLs of each content type. Additionally, they demonstrated the domain name distribution in the collection in a hierarchical chart (using tree-map). A part of their analysis is published on GitHub.





  • Counter-Terrorism (Slides)

  • This team collected 383,527 tweets (between 2013 and 2016) from 1,153 accounts of suspected extremists. Approximately, 300 people are associated with these accounts. The tweets are in mix of English and Arabic. The goal is to identify ISIS supporters by running them through an ideology classifier.




  • Team Turtle (Slides)

  • The team used a dataset from 2004 Presidential Election provided by the Library of Congress. The dataset was collected during the day before the election, the election day, and the day after the election. The goal of this team is to answer questions like (1) if one candidate spends more time talking about issues related to a particular state than the other candidate does, would this lead him to win the state? (2) would candidates give more time to the "swing" states than others? and (3) what is the most important topic for each state? The dataset was available in ARC format. Warcbase tool is used to extract text from those files. After that, the dataset was analyzed using techniques like Stanford NER tagger to tag places, people, and organizations, and the LDA model and TF-IDF to identify topics. Finally, the team produced an interactive visualization using D3.js.







  • Twitter Political Organization (Slides)

  • The team created a timeline of mentions in candidate tweets to donations for the Service Employees International Union (SEIU) on twitter, graph of retweets per day of the candidates and sentiment analysis (Naive Bayes classifier) of the candidates tweets was performed in attempt to see if there was a correlation between donation amount over time to how positive or negatively the candidates tweeted.


    After all the groups presented their work, Jimmy Lin announced ArchivesUnleashed Inc. It is a Delaware non-profit corporation aiming to create knowledge around the scholarly use of web archives. The board of directors of this new organization includes:
    • Ian Milligan (assistant professor, Department of History, University of Waterloo)
    • Matthew Weber (Assistant Professor, School of Communication and Information, Rutgers University)
    • Jimmy Lin (the David R. Cheriton Chair, David R. Cheriton School of Computer Science, University of Waterloo)
    • Nathalie Casemajor (assistant professor, Department of Social Sciences, University of Québec)
    • Nicholas Worby (Government Information & Statistics Librarian, University of Toronto)

    The winning team

    Ian Milligan announced the winning team Counter-Terrorism (Congratulations to Daniel Kerchner and Emily Gade). In addition, the top four teams (Counter-Terrorism, Team Turtle, I Know What You Hid Last Summer, and Mojitos) were selected to present their work during the next day event Saving the Web




    --Mohamed Aturban