Showing posts with label Web Archiving. Show all posts
Showing posts with label Web Archiving. Show all posts

Saturday, November 5, 2016

2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

Internet Archive & Libyan newspapers logos
Colonel Gaddafi ruled Libya for 42 years after taking power from King Idris in a 1969 military coup. In August 2011, his regime was toppled in the so-called Arab Spring. For more than four decades, media in Libya was highly politicized to support Gaddafi’s regime and secure his power. After the Libyan revolution (in 2011), media became freed from the tight control of the government, and we have seen the establishment of tens if not hundreds of new media organizations. Here is an overview of one side, newspapers, of Gaddafi’s propaganda machine:
  • 71 newspapers and magazines 
  • All monitored and published by the Libyan General Press Corporation (LGPC) 
  • The Jamahiriya News Agency (JANA) was the main source of domestic news 
  • No real political function other than to polish the regime’s image 
  • Publish information provided by the regime 
The following are the Libyan most well-known newspapers which are all published by LGPC:



All Libyan newspaper websites are no longer controlled by the government

After the revolution, most of the Libyan newspapers' websites including the website of the Libyan General Press Corporation (LGPC) became controlled by foreign institutions, in particular, by an Egyptian company. Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), and El Fajr El Jadid (www.alfajraljadeed.com/) became Egyptian news websites under different names: Jime News (www.news.aljamahiria.com/), Kifah Arabi (www.news.kifaharabi.com/), and El Fajr El Jadid Alakbaria while the El Zahf Al Akhdar (www.azzahfalakhder.com/) is now a German sport blog. Here are the logos of the new websites (the new websites remain with the same domain name except the alshames.com which redirects to www.news.kifaharabi.com/):


Can we still have access to the old state media?
After this big change in Libya with the fall of the regime, can we still have access to the old state media? (This question might apply to other countries as well. Would any political or regime change in any country lead to loss a part of its digital history?)
Fortunately, Internet Archive has captured thousands of snapshots of the Libyan newspapers' websites. The main pages of Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), El Zahf Al Akhdar (www.azzahfalakhder.com/), and El Fajr El Jadid (www.alfajraljadeed.com/) have been captured 2310, 606, 1398, and 836 times, respectively, by the Internet Archive.

www.aljamahiria.com/ captured 2,310 times by the Internet Archive
www.azzahfalakhder.com/ captured 1,398 times by the Internet Archive

Praise for Qaddafi no longer on the live web
Although we can not conclude that the Internet Archive has captured everything due to the fact that the content in these newspapers was extremely redundant as they focus in praising the regime, the Internet Archive has captured important events, such as the regime's activities during the "2011" revolution, a lot of domestic news and the regime's interpretation of international news, many economic articles, the long process taken by Libyan authorities in order to establish the African Union, Gaddafi's speeches, etc. Below is an example of one of these articles during the Libyan "2011" revolution indicating the "there will be no future for Libya without our leader Gaddafi". This article is no longer available on the live web.          
From the Internet Archive https://web.archive.org/web/20

Slides about this post is also available:
--Mohamed Aturban

Wednesday, October 26, 2016

2016-10-26: They should not be forgotten!

Source: http://www.masrawy.com/News/News_
Various/details/2015/6/7/596077/أسرة-الشهيد-أحمد-بسيوني
-فوجئنا-بصورته-على-قناة-الشرق-والقناة-نرفض-التصريح
I remembered his face and smile very well. It was very tough for me to look at his smile and realize that he will not be in this world again. It got worse for me when I read his story and many others who had died defending the future of my home country, Egypt, hoping to draw a better future for their kids. Ahmed Basiony, one of Egypt’s great artists, was killed by the Egyptian Regime on the January 28th, 2011. One of the main reasons that drove Basiony to participate in the protests is filming police beatings to document the protests. While he was filming, he also used his camera during the demonstration to zoom on the soldiers and warn the people around him so they take cautions before they had gunfire. Suddenly, his camera fell down.

Basiony was a dad for two kids: one and six years old. He has been loved by everyone who knew him.  I hope Basiony's and others' stories will remain for future generations.


Basiony was among the protests in the first days of the Egypt Revolution.
Source: https://www.facebook.com/photo.php?
fbid=206347302708907&set=a.139725092704462.24594.
100000009164407&type=3&theater
curl -I http://1000memories.com/egypt 
HTTP/1.1 404 Not Found 
Date: Tue, 25 Oct 2016 16:53:04 GMT 
Server: nginx/1.4.6 (Ubuntu) 
Content-Type: text/html; charset=UTF-8


Basiony's information and many other martyrs were documented at the site 1000memories.com/egypt. The 1000memories site contained a digital collection of around 403 martyrs with information about their live. The entire Web site is unavailable now, and the Internet Archive is the only place where it was archived. Not only the 1000memories that has been disappeared, there are also many other repositories that contained videos, images, etc. that document the 18 days of the Egyptian Revolution disappeared. Examples are iamtahrir.com (archived version), which contained the artwork produced during the Egyptian Revolution, and 25Leaks.com (archived versions), which contained about 100s of important papers posted by people during the revolution. Both sites were created for collecting content related to the Egyptian Revolution.

The Jan. 25 Egyptian Revolution is one of the most important events that has happened in recent history. Several books and initiatives have been published for documenting the 18 days of the Egyptian Revolution. These books cited many digital collections and other sites that were dedicated to document the Egyptian Revolution (e.g., 25Leaks.com). Unfortunately, the links to many of these Web sites are now broken and there is no way (without the archive) to know what they contained.

Luckily, 1000memories.com/egypt has multiple copies in the "Egypt Revolution and Politics" collection in Archive-It, a subscription service from the Internet Archive that allow institutions to develop, curate, and preserve collections of Web resources. I'm glad I found information of Basiony and many more martyrs archived!


Archiving Web pages is a method for ensuring these resources are available for posterity. My PhD research focused on exploring methods for summarizing and interacting with collections in Archive-It, and recording the events of the Egyptian Revolution spurred my initial interest in web archiving. My research necessarily focused on quantitative analysis, but this post has allowed me to revisit the humanity behind these web pages that would be lost without web archiving.

Sources:


--Yasmin

Tuesday, October 25, 2016

2016-10-25: Web Archive Study Informs Website Design

Shortly after beginning my Ph.D. research with the Old Dominion University Web Science and Digital Libraries team, I also rediscovered a Hampton Roads folk music non-profit I had spent a lot of time with years before.  Somehow I was talked into joining the board (not necessarily the most sensible thing when pursuing a Ph.D.).

My research area being digital preservation and web archiving, I decided to have a look at the Tidewater Friends of Folk Music (TFFM) website and its archived web pages (mementos).  Naturally, I looked at oldest copy of the home page available, 2002-01-25.  What I found is definitely reminiscent of early, mostly hand-coded HTML:

tffm.org 2002-01-25 23:57:26 GMT (Internet Archive)
https://web.archive.org/web/20020125235726/http://tffm.org/


Of course the most important thing for most people is concerts, so I had a look at the concerts page too (interestingly, the newest concerts page available is five years newer than the oldest home page—this phenomena was the subject was of my JCDL 2013 paper.).


tffm.org/concerts 2007-10-07 06:17:32 GMT (Internet Archive)
https://web.archive.org/web/20071007061732/http://tffm.org/concerts.html

Clicking my way through the home and concert page and mementos, I found little had changed over time other than masthead image.

2005-08-26 21:05:28 GMT 2005-12-11 09:23:55 GMT 2009-08-31 06:31:40 GMT

The end result is that I became, and remain, TFFM’s web master.  However, studying web archive quality, that is completeness and temporal coherence, has greatly influenced my redesigns of the TFFM website.  First up was bringing the most important information to the forefront in a much more readable and navigable format.  Here is a memento captured 2011-05-23:

tffm.org 2011-05-23 11:10:54 GMT (Internet Archive)
https://web.archive.org/web/20110523111054/http://www.tffm.org/concerts.html

As part of the redesign, I put my new-found knowledge of archival crawler to use.  The TFFM website now had a proper sitemap and every concert its own URI with very few URI aliases.  This design lasted until the TFFM board decided to replace “Folk” with “Acoustic,” changing the name to Tidewater Friends of Acoustic Music (TFAM).

Along with the change came a brighter look and mobile-friendly design.  Again, putting knowledge from my Ph.D. studies to work, the mobile-friendly design is responsive, adapting to the user’s device, rather than incorporating a second set of URIs and independent design.  With the response approach, archived copies replay correctly in both mobile and desktop browsers.

tidewateracoustic.org 2014-10-07 01:56:07 GMT
https://web.archive.org/web/20141007015607/http://tidewateracoustic.org/

After watching several fellow Ph.D. students struggle with the impact of JavaScript and dynamic HTML on archivability, I elected to minimized the use of JavaScript on the TFAM the site.  JavaScript greatly complicates web archiving and reduces archive quality significantly.

So, the sensibility of taking on a volunteer website project while pursuing my Ph.D. aside, I can say that in some ways the two have synergy.  My Ph.D. studies have influenced the design of the TFAM website and the TFAM website is a small, practical, and personal proving ground for my Ph.D. work.  The two have complemented each other well.

Enjoy live music? Check out http://tidewateracoustic.org!

— Scott G. Ainsworth



2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

Web Archives perform the crucial service of preserving our collective digital heritage. October 26, 2016 marks the 20th anniversary of the Internet Archive, and the United States presidential Election will take place November 8, 2016.  To commemorate both occasions, let us look at the 2008 and 2012 US General Elections as told by Web Archives from the perspectives of CNN and Fox News. We started with three news media - MSNBC, CNN and Fox News in order to capture both ends of the political spectrum. However, msnbc.com has redirected to various different URLs in the past (e.g., msnbc.msn.com, nbcnews.com) and the result is that the site is not well-archived.

Obama vs McCain - Fox News (2008)
Obama vs McCain - CNN (2008)

Obama vs Romney - Fox News (2012)
The archives show that the current concerns about voter fraud and election irregularities are not new (at least on Fox News, we did not find corresponding stories at CNN).
This Fox News page contains a story titled: "Government on High Alert for Voter Fraud" (2008)

Fox News: "Trouble at the ballot box" (2008)

Fox News claims a mural of Obama at a Philly polling station, that was ordered to be covered by a Judge, was not properly covered (2012)
Obama vs Romney - CNN (2012)
We appreciate the ability to tell these stories by virtue of the presence of public Web archives such as the Internet Archive. We also appreciate frameworks such as the Memento protocol that provide a means to access multiple web archives, and tools such as Sawood's Memgator which implements the memento protocol. For the comprehensive list of mementos (extracted with Memgator) for these stories see: Table vis or Timeline vis.
--Nwala

2016-10-25: Paper in the Archive

Mat reports on his journalistic experience and how we can relive it through Internet Archive (#IA20)                            

We have our collections, the things we care about, the mementos that remind us of our past. Many of these things reside on the Web. For those we want to recall and should have (in hindsight) saved, we turn to the Internet Archive.

As a computer science (CS) undergrad at University of Florida, I worked at the student-run university newspaper, The Independent Florida Alligator. This experience became particularly relevant with my recent scholarship to preserve online news. At the paper, we reported mostly on the university community, but also on news that catered to the ACRs through reports about Gainesville (e.g., city politics).

News is compiled late in the day to maximize temporal currency. I started at the paper as a "Section Producer" and eventually evolved to be a Managing Editor. I was in charge of the online edition, the "New Media" counterpart of the daily print edition -- Alligator Online. The late shift fit well with my already established coding schedule.

Proof from '05, with the 'thew' still intact.

The Alligator is an independent newspaper -- the content we published could conflict with the university without fear of being censored by the university. Typical associated college newspapers have this conflict of interest, which potentially limits their content only to that which is approved. This was part of the draw to the paper for me and I imagine, the student readers seeking less biased reporting. The orange boxes were often empty well before day's end. Students and ACRs read the print paper. As a CS student, I preferred Alligator Online.

With a unique technical perspective among my journalistic peers, I introduced a homebrewed content management system (CMS) into the online production process. This allowed Alligator Online to focus on porting the print content and not on futzing with markup. This also made the content far more accessible and, as time has shown thanks to Internet Archive, preservable.

Internet Archive's capture of Alligator Online at alligator.org over time with my time there highlighted in orange.

After graduating from UF in 2006, I continued to live and work elsewhere in Gainesville for a few years. Even then technically an ACR, I still preferred Alligator Online to print. A new set of students transitioned into production of Alligator Online and eventually deployed a new CMS.

Now as a PhD student of CS studying the past Web, I have observed a resultant decline in accessibility that occurred after I had moved on from the paper. This corresponds further with our work On the Change in Archivability of Websites Over Time (PDF). Thankfully, adaptations at Alligator Online and possibly IA have allowed the preservation rate to recover (see above, post-tenure).

alligator.org before (2004) and after (2006) I managed, per captures by Internet Archive.

With Internet Archive celebrating 20 years in existence (#IA20), IA has provided the means for me to see the aforementioned trend in time. My knowledge in the mid-2000s of web standards and accessibility facilitated preservation. Because of this, with special thanks to IA, the collections of pages I care about -- the mementos that remind me of my past -- are accessible and well-preserved.

— Mat (@machawk1)

NOTE: Only after publishing this post I thought to check alligator.org's robots.txt file as archived by IA. The final capture of alligator.org in 2007 before the next temporally adjacent one in 2009 occurred on August 7, 2007. At that time (and prior), no robots.txt file existed for alligator.org despite IA preserving the 404. Around late October of that same year, a robot.txt file was introduced with the lines:
User-Agent: *
Disallow: /

Saturday, October 22, 2016

2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)


Dodging the Memory Hole 2016, held at UCLA's Charles Young Research Library in Los Angeles California, was a two-day event to discuss and highlight potential solutions to the issue of preserving born-digital news. Organized by Edward McCain (digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries) this event brought together technologists, archivists, librarians, journalists and fourteen graduate students who had won travel scholarships for attendance.  Among the attendees were four members of the WS-DL group (l-r): Mat KellyJohn BerlinDr. Michael Nelson, and Shawn Jones.


Day 1 (October 13, 2016)

Day one started off at 9am with Edward McCain welcoming everyone to the event and then turning it over to Ginny Steel, UCLA University Librarian, for opening remarks.
In the opening remarks, Ginny reflected on her career as a lifelong librarian, the evolution of printed news to digital and in closing she summarized the role archiving has to play in the digital-born news era.
After opening remarks, Edward McCain went over the goals and sponsors of the event before transitioning to the first speaker Hjalmar Gislason.


In the talk, Hjalmar touched on issues concerning the amount of data currently being generated, how to determine context about data and the importance of if and that data lost due to not knowing if it is important could mean losing someone's life work. Hjalmar ended his talk with two takeaway points: "There is more to news archiving than the web: there is mobile content" and "Television news is also content that is important to save".

After a short break, panel one which consisted of Chris Freeland, Matt Weber, Laura Wrubel, and moderator Ana Krahmer addressed the question of "Why Save Online News".

Matt Weber started off the discussion by talking about the interactions between web archives and news media. Stating that digital only media has no offline surrogate and how it is becoming increasingly difficult to do anything but look at it now as it exists. Following Mat Weber were Laura Wrubel and Chris Freeland who both talked about the large share Twitter has in online news.  Laura Wrubel brought up that in 2011 journalists primarily used Twitter to direct people to articles rather than for conversation. Chris Freeland stated that Twitter the primary source of information during the Ferguson protests in St. Louis and that the local news outlets were far behind in reporting the organic story as it happened.
Following panel one was Tim Groeling (professor and former chair of the UCLA Department of Communication Studies) giving presentation one entitled "NewsScape: Preserving TV News".

The NewsScape project is currently migrating analog recordings of TV news to digital for archival lead by Tim Groesling.  The collection contains recording dating back to 1950's and is the largest collection of TV news and public affairs programs containing a mix of U-matic, Betamax, and VHS tapes.


Currently, the project is working its way through the collections tapes completing 36k hours of encoding this year. Tim Groeling pointed out that VHS despite being the newest tapes are the most threatened.
After lunch, the attendees were broken up into fifteen groups for the first of two breakout sessions. Each group was tasked with formulating three things that could be included in a national agenda for news preservation and to come up with a project to advance the practice of online news preservation.

Each group sent up one person who briefly went over what they had come up with. Despite the diverse background of the attendees at dtmh2016 the ideas that each group came up with had a lot in common:
  • A list of tools/technologies for archiving (awesome memento)
  • Identifying broken links in new articles 
  • Increase awareness of how much or how little is archived
  • Work with news organization to increase their involvement in archiving 
  • More meetups, events, hackathons that bring together technologists
    with journalists and librarians  
The final speaker of the day was Clifford Lynch giving a talk entitled "Born-digital news preservation in perspective".
In his talk, Clifford Lynch spoke about problems that plague news preservation such as link rot and the need for multiple archives.

He also spoke on the need to preserve other kinds of media like data dumps and that archival record keeping goes hand in hand with journalism.
After his talk was over Edward McCain gave final remarks for day one and transitioned us to reception for the scholarship winners. The scholarship winners purposed projects (to be completed by December 2016) that would aid in digital news preservation and of these students three were WS-DL members (Shawn JonesMat KellyJohn Berlin).

Day 2 (October 14, 2016)

Day two of dodging the memory hole 2016 began with Sharon Farb welcoming us back.

Followed by the first presentation of the day by our very own Dr. Nelson titled "Summarizing archival collections using storytelling techniques"


The presentation highlighted the work done by Yasmin AlNoamany in her doctoral dissertation, in particular, The Dark and Stormy Archives (DSA) Framework.
Up next was Pulitzer prize winning journalist Peter Arnett who presented "Writing The First Draft of History - and Saving It!" talking about his experiences while covering the Vietnam War and how he saved the Associated Presses Saigon office archives.
Following Perter Arnett was the second to last panel of dtmh2016 Kiss your app goodbye: the fragility of data journalism featuring Ben Welsh, Regina Roberts, Meredith Broussard and moderated by Martin Klein.


Meredith Broussard spoke about how archiving of news apps has become difficult as their content does not live in a single place.
Ben Welsh was up next speaking about the work he has done at the LA Times Data Desk.
In his talk, he stressed the need for more tools to be made that allowed people like himself to make archiving and viewing of archived news content easier.
Following Ben Welsh was Regina Roberts who spoke about the work done at Standford for archiving and adding context to the data sets that live beside the codebases of research projects.
The last panel of dtmh2016 "The future of the past: modernizing The New York Times archive" featured members of the technology team at the New York Times Evan Sandhaus, Jane Cotler, and Sophia Van Valkenburg with moderator Edward McCain.

Evan Sandhause presented the New York Times own take on the wayback machine called TimesMachine. The TimesMachine allows users to view the microfilm archive of The New York Times.
Sophia Van Valkenburg spoke about how the New York Times was transitioning its news archives into a more modern system.
After Sophia Valkenburg, was Jan Cotler who spoke about the gotchas encountered during the migration process. Most notable of the gotchas was that the way in which the articles were viewed (i.e, visual aesthetics) was not preserved in the migration process in favor of a "better user experience" and that in migrating to the new system links to the old pages would no longer work.
Lightning rounds were up next.

Mark Grahm of the Internet Archive was up first with a presentation on the wayback machine and how later this year it would be getting site search.
Jefferson Bailey also of the Internet Archive spoke on the continual efforts at the Internet Archive to get the web archives into the hands of researchers.
Terry Britt spoke about how social media over time establishes "collective memory".
Katherine Boss presented "Challenges facing the preservation of born-digital news applications" and how they end up in dependency hell.
Eva Revear presented a tool to discover frameworks and software used for news apps
Cynthia Joyce talked about a book on Hurricane Katrina and its use of archived news coverage of the storm.
Jennifer Younger presented the work being done by the Catholic News Archive.
Kalev Leetaru talked about the work he and the gdeltproject  are doing in web archival.
The last presentation of the event was by Kate Zwaard titled "Technology and community Why we need partners, collaborators, and friends".

Kate Zwaard talked about the success of web archival events such as the recent Collections as Data and Archives Unleashed 2.0 held at the Library of Congress.
The web archive collection at the Library of Congress.
How they are putting Jupyter notebooks on top of database dumps.
And the diverse skill sets required for librarians of today.
The final breakout sessions of dtmh2016 consisted of four topic discussions.

Jefferson Bailey's session, Web Archiving For News, was an informal breakout where he asked the attendants about collaboration between the Archive and other organizations. A notable response was from the NYTimes representative Evan Sandhaus with a counter question about whether organizations or archives should be responsible for the preservation of news content. Jefferson Bailey responded that he wished organizations were more active in practicing self-archiving. Others responded with their organizations or ones they knew about approaches to self-archiving.

Ben Welsh's session, News Apps, discussed issues archiving news apps which are online web applications providing rich data experiences. An example app to illustrate this was California's War Dead which was archived by the Internet Archive but with diminished functionality. In spite of this "success", Ben Welsh brought up the difficulty in preserving the full experience of the app as web crawlers only interact with client side code, not server side which is required. To address this issue, he suggested solutions such as the python library django-backery for producing flat, static versions of news apps based on database queries. These static versions can be more easily archived while still providing a fuller experience when replayed.
Eric Weig's session, Working with CMS, started out with him sharing his experience of migrating one the Univeristy of Kentucky Libraries Special Collections Research Center newspaper sites cms from a local data center using sixteen cpus to a less powerful cloud-based solution using only two cpus. One of the biggest performance increases came when he switched from dynamically generating pages to serving static html pages. Generating the static html pages for the eighty-two thousand issues contained in this cms took only three hours on the two cpu cloud-based solution. After sharing this experience the rest of the time was used to hear from the audience about their experiences using cms and an impromptu roundtable discussion on cms.

Kalev Leetaru's session, The GDELT Project: A Look Inside The World's Largest Initiative To Understand And Archive The World's News, was a more in depth version of the lightning talk he gave. Kalev Leetaru shared experiences that The GDELT Project had with archival crawling of non-English language news sites, his work with the Internet Archive on monitoring news feeds and broadcasts, the untapped opportunities for exploration of Internet Archive and A Vision Of The Role and Future Of Web Archives. He also shared two questions he is currently pondering: "Why are archives checking certain news organizations more than others?" and "How do we preserve GeoIP generated content especially in non-western news sites?".
The last speaker of dtmh2016 was Katherine Skinner with Alignment and Reciprocity. In her speech Katherine Skinner called for volunteers to carry out some of the actions mentioned at dtmh2016 and reflected on the past two days.
Closing out dtmh2016 was Edward McCain who thanked everyone for coming and expressed how enjoyable this event was especially with the graduate students and Todd Grappone's closing remarks. In the closing remarks, Todd Grappone reminded attendees of the pressing problems in news archival and how they require both academic and software solutions.
Video recordings of DTMH2016 can be found on the Reynolds Journalism Institute's Facebook pageChris Aldrich recorded audio along with a transcription of days one and two. NPR's Research, Archive & Data Strategy team created a storify page of tweets covering topics they found interesting.

-- John Berlin 

Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?


"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)
The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:
After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State Candidate Frequency of mentioning
the state
The most important
topic
Mississippi Kerry
85
Iraq
Mississippi Bush
131
Energy
Oklahoma Kerry
65
Jobs
Oklahoma Bush
85
Retirement
Delaware Kerry
53
Colleges
Delaware Bush
2
Other
Minnesota Kerry
155
Jobs
Minnesota Bush
303
Colleges
Illinois Kerry
86
Iraq
Illinois Bush
131
Health
Georgia Kerry
101
Energy
Georgia Bush
388
Tax
Arkansas Kerry
66
Iraq
Arkansas Bush
42
Colleges
New Mexico Kerry
157
Jobs
New Mexico Bush
384
Tax
Indiana Kerry
132
Tax
Indiana Bush
43
Colleges
Maryland Kerry
94
Jobs
Maryland Bush
213
Energy
Louisiana Kerry
60
Iraq
Louisiana Bush
262
Tax
Texas Kerry
195
Terrorism
Texas Bush
1108
Tax
Tennessee Kerry
69
Tax
Tennessee Bush
134
Teacher
Arizona Kerry
77
Iraq
Arizona Bush
369
Jobs
         ...

[3]  Interactive US map 

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).


Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.


--Mohamed Aturban