Web Science and Digital Libraries Research Group: Web Archiving

Showing posts with label Web Archiving. Show all posts

Saturday, November 5, 2016

2016-11-05: Pro-Gaddafi Digital Newspapers Disappeared from the Live Web!

Internet Archive & Libyan newspapers logos

Colonel Gaddafi ruled Libya for 42 years after taking power from King Idris in a 1969 military coup. In August 2011, his regime was toppled in the so-called Arab Spring. For more than four decades, media in Libya was highly politicized to support Gaddafi’s regime and secure his power. After the Libyan revolution (in 2011), media became freed from the tight control of the government, and we have seen the establishment of tens if not hundreds of new media organizations. Here is an overview of one side, newspapers, of Gaddafi’s propaganda machine:

71 newspapers and magazines
All monitored and published by the Libyan General Press Corporation (LGPC)
The Jamahiriya News Agency (JANA) was the main source of domestic news
No real political function other than to polish the regime’s image
Publish information provided by the regime

The following are the Libyan most well-known newspapers which are all published by LGPC:

All Libyan newspaper websites are no longer controlled by the government

After the revolution, most of the Libyan newspapers' websites including the website of the Libyan General Press Corporation (LGPC) became controlled by foreign institutions, in particular, by an Egyptian company. Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), and El Fajr El Jadid (www.alfajraljadeed.com/) became Egyptian news websites under different names: Jime News (www.news.aljamahiria.com/), Kifah Arabi (www.news.kifaharabi.com/), and El Fajr El Jadid Alakbaria while the El Zahf Al Akhdar (www.azzahfalakhder.com/) is now a German sport blog. Here are the logos of the new websites (the new websites remain with the same domain name except the alshames.com which redirects to www.news.kifaharabi.com/):

Can we still have access to the old state media?

After this big change in Libya with the fall of the regime, can we still have access to the old state media? (This question might apply to other countries as well. Would any political or regime change in any country lead to loss a part of its digital history?)

Fortunately, Internet Archive has captured thousands of snapshots of the Libyan newspapers' websites. The main pages of Al Jamahiriya (www.aljamahiria.com/), El shams (alshames.com), El Zahf Al Akhdar (www.azzahfalakhder.com/), and El Fajr El Jadid (www.alfajraljadeed.com/) have been captured 2310, 606, 1398, and 836 times, respectively, by the Internet Archive.

www.aljamahiria.com/ captured 2,310 times by the Internet Archive

www.azzahfalakhder.com/ captured 1,398 times by the Internet Archive

Praise for Qaddafi no longer on the live web

Although we can not conclude that the Internet Archive has captured everything due to the fact that the content in these newspapers was extremely redundant as they focus in praising the regime, the Internet Archive has captured important events, such as the regime's activities during the "2011" revolution, a lot of domestic news and the regime's interpretation of international news, many economic articles, the long process taken by Libyan authorities in order to establish the African Union, Gaddafi's speeches, etc. Below is an example of one of these articles during the Libyan "2011" revolution indicating the "there will be no future for Libya without our leader Gaddafi". This article is no longer available on the live web.

From the Internet Archive https://web.archive.org/web/20

110514103049/http://www.alfajraljadeed.com//full.pdf

From the live web http://www.alfajraljadeed.com//full.pdf

Slides about this post is also available:

--Mohamed Aturban

Wednesday, October 26, 2016

2016-10-26: They should not be forgotten!

Source: http://www.masrawy.com/News/News_
Various/details/2015/6/7/596077/أسرة-الشهيد-أحمد-بسيوني
-فوجئنا-بصورته-على-قناة-الشرق-والقناة-نرفض-التصريح

I remembered his face and smile very well. It was very tough for me to look at his smile and realize that he will not be in this world again. It got worse for me when I read his story and many others who had died defending the future of my home country, Egypt, hoping to draw a better future for their kids. Ahmed Basiony, one of Egypt’s great artists, was killed by the Egyptian Regime on the January 28th, 2011. One of the main reasons that drove Basiony to participate in the protests is filming police beatings to document the protests. While he was filming, he also used his camera during the demonstration to zoom on the soldiers and warn the people around him so they take cautions before they had gunfire. Suddenly, his camera fell down.

Basiony was a dad for two kids: one and six years old. He has been loved by everyone who knew him. I hope Basiony's and others' stories will remain for future generations.

Basiony was among the protests in the first days of the Egypt Revolution.
Source: https://www.facebook.com/photo.php?
fbid=206347302708907&set=a.139725092704462.24594.
100000009164407&type=3&theater

curl -I http://1000memories.com/egypt
HTTP/1.1 404 Not Found
Date: Tue, 25 Oct 2016 16:53:04 GMT
Server: nginx/1.4.6 (Ubuntu)
Content-Type: text/html; charset=UTF-8

Basiony's information and many other martyrs were documented at the site 1000memories.com/egypt. The 1000memories site contained a digital collection of around 403 martyrs with information about their live. The entire Web site is unavailable now, and the Internet Archive is the only place where it was archived. Not only the 1000memories that has been disappeared, there are also many other repositories that contained videos, images, etc. that document the 18 days of the Egyptian Revolution disappeared. Examples are iamtahrir.com (archived version), which contained the artwork produced during the Egyptian Revolution, and 25Leaks.com (archived versions), which contained about 100s of important papers posted by people during the revolution. Both sites were created for collecting content related to the Egyptian Revolution.

An archived copy of 1000memories in Archive-It.

The Jan. 25 Egyptian Revolution is one of the most important events that has happened in recent history. Several books and initiatives have been published for documenting the 18 days of the Egyptian Revolution. These books cited many digital collections and other sites that were dedicated to document the Egyptian Revolution (e.g., 25Leaks.com). Unfortunately, the links to many of these Web sites are now broken and there is no way (without the archive) to know what they contained.

Luckily, 1000memories.com/egypt has multiple copies in the "Egypt Revolution and Politics" collection in Archive-It, a subscription service from the Internet Archive that allow institutions to develop, curate, and preserve collections of Web resources. I'm glad I found information of Basiony and many more martyrs archived!

Archiving Web pages is a method for ensuring these resources are available for posterity. My PhD research focused on exploring methods for summarizing and interacting with collections in Archive-It, and recording the events of the Egyptian Revolution spurred my initial interest in web archiving. My research necessarily focused on quantitative analysis, but this post has allowed me to revisit the humanity behind these web pages that would be lost without web archiving.

Sources:

--Yasmin

Tuesday, October 25, 2016

2016-10-25: Web Archive Study Informs Website Design

Shortly after beginning my Ph.D. research with the Old Dominion University Web Science and Digital Libraries team, I also rediscovered a Hampton Roads folk music non-profit I had spent a lot of time with years before. Somehow I was talked into joining the board (not necessarily the most sensible thing when pursuing a Ph.D.).

My research area being digital preservation and web archiving, I decided to have a look at the Tidewater Friends of Folk Music (TFFM) website and its archived web pages (mementos). Naturally, I looked at oldest copy of the home page available, 2002-01-25. What I found is definitely reminiscent of early, mostly hand-coded HTML:

tffm.org 2002-01-25 23:57:26 GMT (Internet Archive)
https://web.archive.org/web/20020125235726/http://tffm.org/

Of course the most important thing for most people is concerts, so I had a look at the concerts page too (interestingly, the newest concerts page available is five years newer than the oldest home page—this phenomena was the subject was of my JCDL 2013 paper.).

tffm.org/concerts 2007-10-07 06:17:32 GMT (Internet Archive)
https://web.archive.org/web/20071007061732/http://tffm.org/concerts.html

Clicking my way through the home and concert page and mementos, I found little had changed over time other than masthead image.


2005-08-26 21:05:28 GMT	2005-12-11 09:23:55 GMT	2009-08-31 06:31:40 GMT

The end result is that I became, and remain, TFFM’s web master. However, studying web archive quality, that is completeness and temporal coherence, has greatly influenced my redesigns of the TFFM website. First up was bringing the most important information to the forefront in a much more readable and navigable format. Here is a memento captured 2011-05-23:

tffm.org 2011-05-23 11:10:54 GMT (Internet Archive)
https://web.archive.org/web/20110523111054/http://www.tffm.org/concerts.html

As part of the redesign, I put my new-found knowledge of archival crawler to use. The TFFM website now had a proper sitemap and every concert its own URI with very few URI aliases. This design lasted until the TFFM board decided to replace “Folk” with “Acoustic,” changing the name to Tidewater Friends of Acoustic Music (TFAM).

Along with the change came a brighter look and mobile-friendly design. Again, putting knowledge from my Ph.D. studies to work, the mobile-friendly design is responsive, adapting to the user’s device, rather than incorporating a second set of URIs and independent design. With the response approach, archived copies replay correctly in both mobile and desktop browsers.

tidewateracoustic.org 2014-10-07 01:56:07 GMT
https://web.archive.org/web/20141007015607/http://tidewateracoustic.org/

After watching several fellow Ph.D. students struggle with the impact of JavaScript and dynamic HTML on archivability, I elected to minimized the use of JavaScript on the TFAM the site. JavaScript greatly complicates web archiving and reduces archive quality significantly.

So, the sensibility of taking on a volunteer website project while pursuing my Ph.D. aside, I can say that in some ways the two have synergy. My Ph.D. studies have influenced the design of the TFAM website and the TFAM website is a small, practical, and personal proving ground for my Ph.D. work. The two have complemented each other well.

Enjoy live music? Check out http://tidewateracoustic.org!

— Scott G. Ainsworth

2016-10-26: A look back at the 2008 and 2012 US General Elections via Web Archives

Web Archives perform the crucial service of preserving our collective digital heritage. October 26, 2016 marks the 20th anniversary of the Internet Archive, and the United States presidential Election will take place November 8, 2016. To commemorate both occasions, let us look at the 2008 and 2012 US General Elections as told by Web Archives from the perspectives of CNN and Fox News. We started with three news media - MSNBC, CNN and Fox News in order to capture both ends of the political spectrum. However, msnbc.com has redirected to various different URLs in the past (e.g., msnbc.msn.com, nbcnews.com) and the result is that the site is not well-archived.

Obama vs McCain - Fox News (2008)

Obama vs McCain - CNN (2008)

Obama vs Romney - Fox News (2012)

The archives show that the current concerns about voter fraud and election irregularities are not new (at least on Fox News, we did not find corresponding stories at CNN).

This Fox News page contains a story titled: "Government on High Alert for Voter Fraud" (2008)

Fox News: "Trouble at the ballot box" (2008)

Fox News claims a mural of Obama at a Philly polling station, that was ordered to be covered by a Judge, was not properly covered (2012)

Fox News reports about Election day Monitors on the lookout for voter fraud and "funny business" (2012)

Obama vs Romney - CNN (2012)

We appreciate the ability to tell these stories by virtue of the presence of public Web archives such as the Internet Archive. We also appreciate frameworks such as the Memento protocol that provide a means to access multiple web archives, and tools such as Sawood's Memgator which implements the memento protocol. For the comprehensive list of mementos (extracted with Memgator) for these stories see: Table vis or Timeline vis.

--Nwala

2016-10-25: Paper in the Archive

Mat reports on his journalistic experience and how we can relive it through Internet Archive (#IA20)

We have our collections, the things we care about, the mementos that remind us of our past. Many of these things reside on the Web. For those we want to recall and should have (in hindsight) saved, we turn to the Internet Archive.

As a computer science (CS) undergrad at University of Florida, I worked at the student-run university newspaper, The Independent Florida Alligator. This experience became particularly relevant with my recent scholarship to preserve online news. At the paper, we reported mostly on the university community, but also on news that catered to the ACRs through reports about Gainesville (e.g., city politics).

News is compiled late in the day to maximize temporal currency. I started at the paper as a "Section Producer" and eventually evolved to be a Managing Editor. I was in charge of the online edition, the "New Media" counterpart of the daily print edition -- Alligator Online. The late shift fit well with my already established coding schedule.

Proof from '05, with the 'thew' still intact.

The Alligator is an independent newspaper -- the content we published could conflict with the university without fear of being censored by the university. Typical associated college newspapers have this conflict of interest, which potentially limits their content only to that which is approved. This was part of the draw to the paper for me and I imagine, the student readers seeking less biased reporting. The orange boxes were often empty well before day's end. Students and ACRs read the print paper. As a CS student, I preferred Alligator Online.

With a unique technical perspective among my journalistic peers, I introduced a homebrewed content management system (CMS) into the online production process. This allowed Alligator Online to focus on porting the print content and not on futzing with markup. This also made the content far more accessible and, as time has shown thanks to Internet Archive, preservable.

Internet Archive's capture of Alligator Online at alligator.org over time with my time there highlighted in orange.

After graduating from UF in 2006, I continued to live and work elsewhere in Gainesville for a few years. Even then technically an ACR, I still preferred Alligator Online to print. A new set of students transitioned into production of Alligator Online and eventually deployed a new CMS.

Now as a PhD student of CS studying the past Web, I have observed a resultant decline in accessibility that occurred after I had moved on from the paper. This corresponds further with our work On the Change in Archivability of Websites Over Time (PDF). Thankfully, adaptations at Alligator Online and possibly IA have allowed the preservation rate to recover (see above, post-tenure).

alligator.org before (2004) and after (2006) I managed, per captures by Internet Archive.

With Internet Archive celebrating 20 years in existence (#IA20), IA has provided the means for me to see the aforementioned trend in time. My knowledge in the mid-2000s of web standards and accessibility facilitated preservation. Because of this, with special thanks to IA, the collections of pages I care about -- the mementos that remind me of my past -- are accessible and well-preserved.

— Mat (@machawk1)

NOTE: Only after publishing this post I thought to check alligator.org's robots.txt file as archived by IA. The final capture of alligator.org in 2007 before the next temporally adjacent one in 2009 occurred on August 7, 2007. At that time (and prior), no robots.txt file existed for alligator.org despite IA preserving the 404. Around late October of that same year, a robot.txt file was introduced with the lines:
User-Agent: *
Disallow: /

Saturday, October 22, 2016

2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

Dodging the Memory Hole 2016, held at UCLA's Charles Young Research Library in Los Angeles California, was a two-day event to discuss and highlight potential solutions to the issue of preserving born-digital news. Organized by Edward McCain (digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries) this event brought together technologists, archivists, librarians, journalists and fourteen graduate students who had won travel scholarships for attendance. Among the attendees were four members of the WS-DL group (l-r): Mat Kelly, John Berlin, Dr. Michael Nelson, and Shawn Jones.

The event was made possible by support from the Reynolds Journalism Institute, Journalism Digital News Archive (JDNA), UCLA Library, the Educopia Institute and the Institute of Museum and Library Services (IMLS).

Day 1 (October 13, 2016)

Day one started off at 9am with Edward McCain welcoming everyone to the event and then turning it over to Ginny Steel, UCLA University Librarian, for opening remarks.

@RJIJDNA @UCLA @vsteel Saving Online #News #Legal #Technical #Policy #dtmh2016 #digitalmemory #freeexpression #historicalrecord #infoaccess pic.twitter.com/D5WFalhrWR
- Sharon E. Farb (@FarbThink) October 13, 2016

In the opening remarks, Ginny reflected on her career as a lifelong librarian, the evolution of printed news to digital and in closing she summarized the role archiving has to play in the digital-born news era.

The challenge to #dtmh2016 is to develop a framework to preserve the news. @vsteel
- Todd Grappone (@liber8er) October 13, 2016

After opening remarks, Edward McCain went over the goals and sponsors of the event before transitioning to the first speaker Hjalmar Gislason.

Hjalmar Gislason's talk was entitled "Digital Salvage Operations: What's worth saving?"

#DtMH2016 @hjalli: The fundamental question: Do you want to save everything or do you want to get rid of everything?
- ChrisAldrich (@ChrisAldrich) October 13, 2016

In the talk, Hjalmar touched on issues concerning the amount of data currently being generated, how to determine context about data and the importance of if and that data lost due to not knowing if it is important could mean losing someone's life work. Hjalmar ended his talk with two takeaway points: "There is more to news archiving than the web: there is mobile content" and "Television news is also content that is important to save".

@hjalli Keynote #dtmh2016 #Digital #Salvage What #News Worth Saving? Not enough to save #stories. #Context is Everything #authenticity #1984 pic.twitter.com/vH82BZeD4F
- Sharon E. Farb (@FarbThink) October 13, 2016

After a short break, panel one which consisted of Chris Freeland, Matt Weber, Laura Wrubel, and moderator Ana Krahmer addressed the question of "Why Save Online News".

Next Up. Why #Save #Online #News? Challenge #Access #Online #News Post Event @chrisfreeland @liblaura @docmattweber #dtmh2016 #localnews pic.twitter.com/KTpg7z4NwB
- Sharon E. Farb (@FarbThink) October 13, 2016

Matt Weber started off the discussion by talking about the interactions between web archives and news media. Stating that digital only media has no offline surrogate and how it is becoming increasingly difficult to do anything but look at it now as it exists. Following Mat Weber were Laura Wrubel and Chris Freeland who both talked about the large share Twitter has in online news. Laura Wrubel brought up that in 2011 journalists primarily used Twitter to direct people to articles rather than for conversation. Chris Freeland stated that Twitter the primary source of information during the Ferguson protests in St. Louis and that the local news outlets were far behind in reporting the organic story as it happened.

#dtmh2016 @docmattweber "I don't think you'll ever convince publishers at scale to donate their economic property to memory institutions."
- Kate Zwaard (@kzwa) October 13, 2016

Following panel one was Tim Groeling (professor and former chair of the UCLA Department of Communication Studies) giving presentation one entitled "NewsScape: Preserving TV News".

The NewsScape project is currently migrating analog recordings of TV news to digital for archival lead by Tim Groesling. The collection contains recording dating back to 1950's and is the largest collection of TV news and public affairs programs containing a mix of U-matic, Betamax, and VHS tapes.

Currently, the project is working its way through the collections tapes completing 36k hours of encoding this year. Tim Groeling pointed out that VHS despite being the newest tapes are the most threatened.

#DtMH2016 Tim Groeling: We use a layer of dead VCR's over our good VCR's to prevent RF interference and audio buzzing. :)
- ChrisAldrich (@ChrisAldrich) October 13, 2016

After lunch, the attendees were broken up into fifteen groups for the first of two breakout sessions. Each group was tasked with formulating three things that could be included in a national agenda for news preservation and to come up with a project to advance the practice of online news preservation.

Each group sent up one person who briefly went over what they had come up with. Despite the diverse background of the attendees at dtmh2016 the ideas that each group came up with had a lot in common:

A list of tools/technologies for archiving (awesome memento)
Identifying broken links in new articles
Increase awareness of how much or how little is archived
Work with news organization to increase their involvement in archiving
More meetups, events, hackathons that bring together technologists
with journalists and librarians

The final speaker of the day was Clifford Lynch giving a talk entitled "Born-digital news preservation in perspective".

Dr. Clifford Lynch #dtmh2016 speaking about problems of scholarly journals, a topic very important to Phittle pic.twitter.com/ZWE6SFEXir
— Phittle (@ThePhittle) October 13, 2016

In his talk, Clifford Lynch spoke about problems that plague news preservation such as link rot and the need for multiple archives.

#DtMH2016 Clifford Lynch: The material on lots of links (as sources) disappears after a short period of time.
— ChrisAldrich (@ChrisAldrich) October 13, 2016

Clifford Lynch of @cni_org at #dtmh2016: "We have this mythology that @internetarchive archives the web. ... It's not a total solution."
— Ben Welsh (@palewire) October 13, 2016

He also spoke on the need to preserve other kinds of media like data dumps and that archival record keeping goes hand in hand with journalism.

Who preserves the data dumps? Who preserves the PDFs and reports? No one has really stepped up. -Clifford Lynch #dtmh2016
— P. Kim Bui (@kimbui) October 13, 2016

"Responsible journalism implies a strong permanent record of that work." -- Clifford Lynch #dtmh2016
— Kate Zwaard (@kzwa) October 13, 2016

After his talk was over Edward McCain gave final remarks for day one and transitioned us to reception for the scholarship winners. The scholarship winners purposed projects (to be completed by December 2016) that would aid in digital news preservation and of these students three were WS-DL members (Shawn Jones, Mat Kelly, John Berlin).

#dtmh2016 Introducing Amazing #GradStudents #Scholars #UCLA #Saving #Online #News #Students=Future pic.twitter.com/WGKFNrHs3O
- Sharon E. Farb (@FarbThink) October 13, 2016

Day 2 (October 14, 2016)

Day two of dodging the memory hole 2016 began with Sharon Farb welcoming us back.

@FarbThink greeting #dtmh2016 participants on start of day 2 @UCLA_library #savenews pic.twitter.com/2u353dLi9R
— Edward McCain (@e_mccain) October 14, 2016

@FarbThink talks about human rights and the role journalism and journalists play. It's critical that we preserve that work #dtmh2016
— Todd Grappone (@liber8er) October 14, 2016

Followed by the first presentation of the day by our very own Dr. Nelson titled "Summarizing archival collections using storytelling techniques"

#dtmh2016 @phonedude_mln presents work with @yasmina_anwar and @weiglemc on "summarizing archival collections using storytelling techniques" pic.twitter.com/pW38yRrYs0
— Shawn M. Jones (@shawnmjones) October 14, 2016

The presentation highlighted the work done by Yasmin AlNoamany in her doctoral dissertation, in particular, The Dark and Stormy Archives (DSA) Framework.

#dtmh2016 @phonedude_mln details the Dark and Stormy Archives (DSA) framework for storytelling with archives pic.twitter.com/cpF6pBB2kQ
— Shawn M. Jones (@shawnmjones) October 14, 2016

Up next was Pulitzer prize winning journalist Peter Arnett who presented "Writing The First Draft of History - and Saving It!" talking about his experiences while covering the Vietnam War and how he saved the Associated Presses Saigon office archives.

Peter Arnett talks about being a journalist covering the Vietnam War and censorship #dtmh2016 pic.twitter.com/GgKL3NOMs4
— Todd Grappone (@liber8er) October 14, 2016

Following Perter Arnett was the second to last panel of dtmh2016 Kiss your app goodbye: the fragility of data journalism featuring Ben Welsh, Regina Roberts, Meredith Broussard and moderated by Martin Klein.

Meredith Broussard spoke about how archiving of news apps has become difficult as their content does not live in a single place.

#DtMH2016 @merbroussard: News apps don't live in any of the CMSs. They're bespoke and live on a separate data server.
— ChrisAldrich (@ChrisAldrich) October 14, 2016

This is even more complicated with news apps, which are dynamic and separate from the web CMS @merbroussard #dtmh2016
— Kate Zwaard (@kzwa) October 14, 2016

Ben Welsh was up next speaking about the work he has done at the LA Times Data Desk.

Ben Welsh @palewire presenting on news apps at #dtmh2016 @UCLA_library pic.twitter.com/6MVECkJmOl
— Edward McCain (@e_mccain) October 14, 2016

In his talk, he stressed the need for more tools to be made that allowed people like himself to make archiving and viewing of archived news content easier.

.@palewire Made a Django Momento plugin https://t.co/xlsSO4GJjQ #dtmh2016
— Kate Zwaard (@kzwa) October 14, 2016

Following Ben Welsh was Regina Roberts who spoke about the work done at Standford for archiving and adding context to the data sets that live beside the codebases of research projects.

#dtmh2016 Regina Lee Roberts on preservation and sharing of big data at Stanford pic.twitter.com/1aVaMiX2mR
— Shawn M. Jones (@shawnmjones) October 14, 2016

#dtmh2016 Regina Lee Roberts talks about creating BLDR (big local data repository) at Stanford pic.twitter.com/X9VQO8yRhC
— Shawn M. Jones (@shawnmjones) October 14, 2016

The last panel of dtmh2016 "The future of the past: modernizing The New York Times archive" featured members of the technology team at the New York Times Evan Sandhaus, Jane Cotler, and Sophia Van Valkenburg with moderator Edward McCain.

Evan Sandhause presented the New York Times own take on the wayback machine called TimesMachine. The TimesMachine allows users to view the microfilm archive of The New York Times.

#dtmh2016 @kansandhaus introduces the @nytimes TimesMachine of scans and metadata from microfilm https://t.co/ExRZGkdqfm pic.twitter.com/x3bpvZ0Yut
— Shawn M. Jones (@shawnmjones) October 14, 2016

Sophia Van Valkenburg spoke about how the New York Times was transitioning its news archives into a more modern system.

#dtmh2016 Sophia van Valkenburg demonstrates flowchart for converting legacy born digital articles @nytimes into format used by current CMS pic.twitter.com/XqlHAFGBwa
— Shawn M. Jones (@shawnmjones) October 14, 2016

After Sophia Valkenburg, was Jan Cotler who spoke about the gotchas encountered during the migration process. Most notable of the gotchas was that the way in which the articles were viewed (i.e, visual aesthetics) was not preserved in the migration process in favor of a "better user experience" and that in migrating to the new system links to the old pages would no longer work.

#dtmh2016 Jane Cotler mentioned decommissioning old URLs and how this can lead to link rot for those linking to @nytimes
— Shawn M. Jones (@shawnmjones) October 14, 2016

#DtMH2016 @janecotler: We made the decision of taking out data we had in lieu of making a better user experience for missing sections.
— ChrisAldrich (@ChrisAldrich) October 14, 2016

#dtmh2016 @kansandhaus "much easier to preserve print journalism because it is not a nexus of content and software"
— Shawn M. Jones (@shawnmjones) October 14, 2016

Lightning rounds were up next.

Mark Grahm of the Internet Archive was up first with a presentation on the wayback machine and how later this year it would be getting site search.

@MarkGraham of @internetarchive lightning talk @waybackmachine #dtmh2016 pic.twitter.com/z1ajvUuBvP
— John Berlin (@johnaberlin) October 14, 2016

#dtmh2016 @MarkGraham from @internetarchive discussed "save page now" @internetarchive, upcoming site search, and more
— Shawn M. Jones (@shawnmjones) October 14, 2016

Jefferson Bailey also of the Internet Archive spoke on the continual efforts at the Internet Archive to get the web archives into the hands of researchers.

#dtmh2016 @jefferson_bail on "trying to get web archives into the hands of researchers" pic.twitter.com/TKu9s6MU8b
— Shawn M. Jones (@shawnmjones) October 14, 2016

#dtmh2016 @jefferson_bail is talking about derivative data sets for researchers, extracting metadata from collections into WAT, LGA, WANE
— Shawn M. Jones (@shawnmjones) October 14, 2016

Terry Britt spoke about how social media over time establishes "collective memory".

On episodic and mediated memory. Journalists are responsible for mediated memory. #dtmh2016 pic.twitter.com/hLQH1EFBAg
— P. Kim Bui (@kimbui) October 14, 2016

Katherine Boss presented "Challenges facing the preservation of born-digital news applications" and how they end up in dependency hell.

Lightning talk from Katherine Boss and Meredith Broussard#dtmh2016 pic.twitter.com/EbKxqrfE8j
— John Berlin (@johnaberlin) October 14, 2016

Eva Revear presented a tool to discover frameworks and software used for news apps

#dtmh2016 @erevear presents a survey tool to discover frameworks and software used for news apps pic.twitter.com/LdIdtpnO8q
— Shawn M. Jones (@shawnmjones) October 14, 2016

Cynthia Joyce talked about a book on Hurricane Katrina and its use of archived news coverage of the storm.

#dtmh2016 @cynthiajoyce talks about Hurricane Katrina and a book of the curated experiences of those who covered the storm pic.twitter.com/hw0u0yuMAh
— Shawn M. Jones (@shawnmjones) October 14, 2016

Jennifer Younger presented the work being done by the Catholic News Archive.

#dtmh2016 Jennifer Younger presents https://t.co/1Aw1jWHrM4 pic.twitter.com/m2TNPc0J4E
— Shawn M. Jones (@shawnmjones) October 14, 2016

Kalev Leetaru talked about the work he and the gdeltproject are doing in web archival.

@kalevleetaru giving lighting talk about @gdeltproject #dtmh2016 pic.twitter.com/pKAfMchdKW
— John Berlin (@johnaberlin) October 14, 2016

An overview of the loss of journalistic content. #dtmh2016 pic.twitter.com/HKc6OU8Muy
— P. Kim Bui (@kimbui) October 14, 2016

The last presentation of the event was by Kate Zwaard titled "Technology and community Why we need partners, collaborators, and friends".

Kate Zwaard talked about the success of web archival events such as the recent Collections as Data and Archives Unleashed 2.0 held at the Library of Congress.

#dtmh2016 @kzwa mentioned the success of #archivesunleashed @librarycongress earlier this year pic.twitter.com/YkjY0IoHEJ
— Shawn M. Jones (@shawnmjones) October 14, 2016

The web archive collection at the Library of Congress.

#dtmh2016 @kzwa talking about web archive @librarycongress https://t.co/oNRWXpLv7Q, which is #memento compliant pic.twitter.com/yE09UWNY2B
— Shawn M. Jones (@shawnmjones) October 14, 2016

How they are putting Jupyter notebooks on top of database dumps.

#dtmh2016 @kzwa talks about saving #Jupyter notebooks and database dumps https://t.co/k4cZJues1Q pic.twitter.com/LQa2DPE0C4
— Shawn M. Jones (@shawnmjones) October 14, 2016

And the diverse skill sets required for librarians of today.

"It's like physicists in the '50s." @kzwa of @librarycongress talks about wide range of skill sets necessary for librarians #dtmh2016 pic.twitter.com/chM22UCuAJ
— JDNA (@RJIJDNA) October 14, 2016

The final breakout sessions of dtmh2016 consisted of four topic discussions.

Jefferson Bailey's session, Web Archiving For News, was an informal breakout where he asked the attendants about collaboration between the Archive and other organizations. A notable response was from the NYTimes representative Evan Sandhaus with a counter question about whether organizations or archives should be responsible for the preservation of news content. Jefferson Bailey responded that he wished organizations were more active in practicing self-archiving. Others responded with their organizations or ones they knew about approaches to self-archiving.

Ben Welsh's session, News Apps, discussed issues archiving news apps which are online web applications providing rich data experiences. An example app to illustrate this was California's War Dead which was archived by the Internet Archive but with diminished functionality. In spite of this "success", Ben Welsh brought up the difficulty in preserving the full experience of the app as web crawlers only interact with client side code, not server side which is required. To address this issue, he suggested solutions such as the python library django-backery for producing flat, static versions of news apps based on database queries. These static versions can be more easily archived while still providing a fuller experience when replayed.

Ben Welsh @palewire focusing on how his news app for @latimes is made at #dtmh2016 at @UCLA_library ssavenews pic.twitter.com/5Rk71tc5B9
— Edward McCain (@e_mccain) October 14, 2016

Eric Weig's session, Working with CMS, started out with him sharing his experience of migrating one the Univeristy of Kentucky Libraries Special Collections Research Center newspaper sites cms from a local data center using sixteen cpus to a less powerful cloud-based solution using only two cpus. One of the biggest performance increases came when he switched from dynamically generating pages to serving static html pages. Generating the static html pages for the eighty-two thousand issues contained in this cms took only three hours on the two cpu cloud-based solution. After sharing this experience the rest of the time was used to hear from the audience about their experiences using cms and an impromptu roundtable discussion on cms.

Kalev Leetaru's session, The GDELT Project: A Look Inside The World's Largest Initiative To Understand And Archive The World's News, was a more in depth version of the lightning talk he gave. Kalev Leetaru shared experiences that The GDELT Project had with archival crawling of non-English language news sites, his work with the Internet Archive on monitoring news feeds and broadcasts, the untapped opportunities for exploration of Internet Archive and A Vision Of The Role and Future Of Web Archives. He also shared two questions he is currently pondering: "Why are archives checking certain news organizations more than others?" and "How do we preserve GeoIP generated content especially in non-western news sites?".

@kalevleetaru: datasets @gdeltproject uses. #dtmh2016 pic.twitter.com/bgJPWFsHlp
— John Berlin (@johnaberlin) October 14, 2016

The last speaker of dtmh2016 was Katherine Skinner with Alignment and Reciprocity. In her speech Katherine Skinner called for volunteers to carry out some of the actions mentioned at dtmh2016 and reflected on the past two days.

Katherine Skinner from @Educopia talks about Alignment and Reciprocity at #dtmh2016 @UCLA_library #savenews pic.twitter.com/iOM3UUKQHF
— Edward McCain (@e_mccain) October 14, 2016

Closing out dtmh2016 was Edward McCain who thanked everyone for coming and expressed how enjoyable this event was especially with the graduate students and Todd Grappone's closing remarks. In the closing remarks, Todd Grappone reminded attendees of the pressing problems in news archival and how they require both academic and software solutions.

I'm sad about the end of #dtmh2016; it was good to meet everyone; lots of good experiences pic.twitter.com/bQX6DUXxrx
— Shawn M. Jones (@shawnmjones) October 14, 2016

Video recordings of DTMH2016 can be found on the Reynolds Journalism Institute's Facebook page. Chris Aldrich recorded audio along with a transcription of days one and two. NPR's Research, Archive & Data Strategy team created a storify page of tweets covering topics they found interesting.

-- John Berlin

Monday, October 3, 2016

2016-10-03: Which States and Topics did the Two Presidential Candidates Mention?

"Team Turtle" in Archive Unleashed in Washington DC
(from left to right: N. Chah, S. Marti, M. Aturban , and I. Amin)

The first presidential debate (H. Clinton v. D. Trump) took place on last Monday, September 26, 2016 at Hofstra University, New York. The questions were about topics like economy, taxes, jobs, and race. During the debate, the candidates mentioned those topics (and other issues) and, in many cases, they associated a topic with a particular place or a US state (e.g., shootings in Chicago, Illinois, and crime rate in New York). This reminded me about the work that we had done in the second Archives Unleashed Hackathon, held at the Library of Congress in Washington DC. I worked with the "Team Turtle" (Niel Chah, Steve Marti, Mohamed Aturban, and Imaduddin Amin) on analyzing an archived collection, provided by the Library of Congress, about the 2004 Presidential Election (G. Bush v. J. Kerry). The collection contained hundreds of archived web sites in ARC format. These key web sites are maintained by the candidates or their political parties (e.g., www.georgewbush.com, www.johnkerry.com, www.gop.com, and www.democrats.org) or other newspapers like www.washingtonpost.com and www.nytimes.com. They were crawled on the days around the election day (November 2, 2004). The goal of this project was to investigate "How many times did each candidate mention each state?" and "What topics were they talking about?"

In this event, we had limited time (two days) to finish our project and present findings by the end of the second day. Fortunately, we were able to make it through three main steps: (1) extract plain text from ARC files, (2) apply some techniques to extract named entities and topics, and (3) build a visualization tool to better show the results. Our processing scripts are available on GitHub.

[1] Extract textual data from ARC files:

ARC file format specifies a way to store multiple digital resources in a single file. It is used heavily by the web archive community to store captured web pages (e.g., Internet Archive's Heritrix writes what it finds on the Web in ARC files of 100MB each). ARC is the predecessor format to the now more popular WARC format. We were provided with 145 ARC files, and each of these files contained hundreds of web pages. To read the content of these ARC files, we decided to use Warcbase, an interesting open-source platform for managing web archives. We started by installing Warcbase by following these instructions. Then, we wrote several Apache Spark's Scala scripts to be able to iterate over all ARC files and generate a clean textual version (e.g., by removing all HTML tags). For each archived web page, we extracted its unique ID, crawl date, domain name, full URI, and textual content as shown below (we hid the content of web pages due to copyright issues). Results were collected into a single TSV file.

[2] Extract named entities and topics

We used Stanford Named Entity Recognizer (NER) to tag people and places, while for topic modeling, we used the following techniques:

NLTK to tokenize text
Stemming and removing stop words (involving TF-IDF weighting)
Gensim and Latent Dirichlet Allocation for topic modeling

After applying the above techniques, the results were aggregated in a text file which will be used as input to the visualization tool (described in step [3]). A part of the results are shown in the table below.

State	Candidate	Frequency of mentioning the state	The most important topic
Mississippi	Kerry	85	Iraq
Mississippi	Bush	131	Energy
Oklahoma	Kerry	65	Jobs
Oklahoma	Bush	85	Retirement
Delaware	Kerry	53	Colleges
Delaware	Bush	2	Other
Minnesota	Kerry	155	Jobs
Minnesota	Bush	303	Colleges
Illinois	Kerry	86	Iraq
Illinois	Bush	131	Health
Georgia	Kerry	101	Energy
Georgia	Bush	388	Tax
Arkansas	Kerry	66	Iraq
Arkansas	Bush	42	Colleges
New Mexico	Kerry	157	Jobs
New Mexico	Bush	384	Tax
Indiana	Kerry	132	Tax
Indiana	Bush	43	Colleges
Maryland	Kerry	94	Jobs
Maryland	Bush	213	Energy
Louisiana	Kerry	60	Iraq
Louisiana	Bush	262	Tax
Texas	Kerry	195	Terrorism
Texas	Bush	1108	Tax
Tennessee	Kerry	69	Tax
Tennessee	Bush	134	Teacher
Arizona	Kerry	77	Iraq
Arizona	Bush	369	Jobs

...

[3] Interactive US map

We decided to build an interactive US map using D3.js. As shown below, the state color indicates the winning party (i.e., red for Republican and blue for Democratic) while the size of the bubbles indicates how many times the state was mentioned by the candidate. The visualization required us to provide more information manually like the winning party for each state. In addition, we inserted different locations, latitude and longitude, to locate the bubbles on the map (two circles for each state). By hovering over the bubbles, the most important topic mentioned by the candidate will be shown. If you are interested to interact with the map, visit (http://www.cs.odu.edu/~maturban/hackathon/).

So exciting to see use of our LC web archives in use at the #hackarchives ! So inspiring! pic.twitter.com/DXksowhrn2
— Abbie Grotke (@agrotke) June 15, 2016

Looking at the map might help us answer the research questions, but it might raise other questions, such as why Republicans did not talk about topics related to states like North Dakota, South Dakota, and Utah. Is it because they are always considered as "red" states? On the other hand, it is clear that they paid more attention to other "swing" states like Colorado and Florida. Finally, I would say that it might be useful to introduce this topic at this time as we are close to the next 2016 presidential election (H. Clinton v. D. Trump), and the same analysis could apply again to see what newspapers say about this event.

--Mohamed Aturban

Web Science and Digital Libraries Research Group