|
"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin) |
The
first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at
Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second
Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (
Niel Chah,
Steve Marti,
Mohamed Aturban, and
Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the
2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in
ARC format. These key web sites are maintained by the candidates or their political parties (e.g.,
www.georgewbush.com,
www.johnkerry.com,
www.gop.com, and
www.democrats.org) or other newspapers like
www.washingtonpost.com and
www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "
How many times did each candidate mention each state?" and "What topics were they talking about?"
In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps:
(1) extract plain text from ARC files,
(2) apply some techniques to extract named entities and topics, and
(3) build a visualization tool to better show the results. Our processing scripts are available on
GitHub.
[1] Extract textual data from ARC files:
ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g.,
Internet Archive's
Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular
WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use
Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these
instructions. Then, we wrote several Apache Spark's
Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.
[2] Extract named entities and topics
We used
Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.
State | Candidate | Frequency of mentioning
the state | The most important
topic |
Mississippi | Kerry |
85
| Iraq |
Mississippi | Bush |
131
| Energy |
Oklahoma | Kerry |
65
| Jobs |
Oklahoma | Bush |
85
| Retirement |
Delaware | Kerry |
53
| Colleges |
Delaware | Bush |
2
| Other |
Minnesota | Kerry |
155
| Jobs |
Minnesota | Bush |
303
| Colleges |
Illinois | Kerry |
86
| Iraq |
Illinois | Bush |
131
| Health |
Georgia | Kerry |
101
| Energy |
Georgia | Bush |
388
| Tax |
Arkansas | Kerry |
66
| Iraq |
Arkansas | Bush |
42
| Colleges |
New Mexico | Kerry |
157
| Jobs |
New Mexico | Bush |
384
| Tax |
Indiana | Kerry |
132
| Tax |
Indiana | Bush |
43
| Colleges |
Maryland | Kerry |
94
| Jobs |
Maryland | Bush |
213
| Energy |
Louisiana | Kerry |
60
| Iraq |
Louisiana | Bush |
262
| Tax |
Texas | Kerry |
195
| Terrorism |
Texas | Bush |
1108
| Tax |
Tennessee | Kerry |
69
| Tax |
Tennessee | Bush |
134
| Teacher |
Arizona | Kerry |
77
| Iraq |
Arizona | Bush |
369
| Jobs |
...
[3] Interactive US map
We decided to build an interactive US map using
D3.js. As shown below, the state color indicates the winning party (i.e.,
red for Republican and
blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit
(http://www.cs.odu.edu/~maturban/hackathon/).
Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.
--Mohamed Aturban